Lecture Notes in 
Artificial Intelligence 1634 

Subseries of Lecture Notes in Computer Science 



SasoDzeroski Peter Flach (Eds.) 



Inductive 

Logic Programming 

9th International Workshop, ILP-99 
Bled, Slovenia, June 1999 
Proceedings 




Springer 




Lecture Notes in Artificial Intelligence 1634 

Subseries of Leeture Notes in Computer Seienee 
Edited by J. G. Carbonell and J. Siekmann 

Lecture Notes in Computer Science 

Edited by G. Goos, J. Hartmanis and J. van Leeuwen 




Springer 

Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Singapore 

Tokyo 




Saso Dzeroski Peter Flach (Eds.) 



Inductive 

Logic Programming 



9th International Workshop, ILP-99 
Bled, Slovenia, June 24-27, 1999 
Proceedings 




Springer 




Series Editors 



Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA 
Jorg Siekmann, University of Saarland, Saarbriicken, Germany 



Volume Editors 
Saso Dzeroski 

Jozef Stefan Institute, Department of Intelligent Systems 
Jamova 39, SI-1000 Ljubljana, Slovenia 
E-mail: Saso.Dzeroski@ijs.si 

Peter Flach 

University of Bristol, Department of Computer Science 

Merchant Venturers Building, Woodland Road, Bristol BS8 lUB, UK 

E-mail: Peter.Flach@bristol.ac.uk 



Cataloging-in-Publication data applied for 

Die Deutsche Bibliothek - CIP-Einheitsaufnahme 

Inductive logic programming : 9th international workshop ; 
proceedings / ILP-99, Bled, Slovenia, June 24 - 27, 1999. Saso 
Dzeroski ; Peter Flach (ed.). - Berlin ; Heidelberg ; New York ; 
Barcelona ; Budapest ; Hong Kong ; London ; Milan ; Paris ; 
Singapore ; Tokyo : Springer, 1999 
(Lecture notes in computer science ; Vol. 1634 : Lecture notes in 
artificial intelligence) 

ISBN 3-540-66109-3 



CR Subject Classification (1998): 1.2, D.1.6 

ISBN 3-540-66109-3 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

(c) Springer-Verlag Berlin Heidelberg 1999 
Printed in Germany 

Typesetting: Camera-ready by author 

SPIN 10703406 06/3142 - 5 4 3 2 1 0 Printed on acid-free paper 




Foreword 



This volume contains 3 invited and 24 submitted papers presented at the Ninth 
International Workshop on Inductive Logic Programming, ILP-99. The 24 accep- 
ted papers were selected by the program committee from the 40 papers submitted 
to ILP-99. Each paper was reviewed by three referees, applying high reviewing 
standards. 

ILP-99 was held in Bled, Slovenia, 24-27 June 1999. It was collocated with 
the Sixteenth International Conference on Machine Learning, ICML-99, held 27- 
30 June 1999. On 27 June, ILP-99 and ICML-99 were given a joint invited talk 
by J. Ross Quinlan and a joint poster session where all the papers accepted at 
ILP-99 and ICML-99 were presented. The proceedings of ICML-99 (edited by 
Ivan Bratko and Saso Dzeroski) are published by Morgan Kaufmann. 

We wish to thank all the authors who submitted their papers to ILP-99, the 
program committee members and other reviewers for their help in selecting a 
high-quality program, and the invited speakers: Daphne Roller, Heikki Mannila, 
and J. Ross Quinlan. Thanks are due to Tanja Urbancic and her team and Majda 
Zidanski and her team for the organizational support provided. We wish to thank 
Alfred Hofmann and Anna Kramer of Springer- Verlag for their cooperation in 
publishing these proceedings. Finally, we gratefully acknowledge the financial 
support provided by the sponsors of ILP-99. 
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Abstract. Probabilistic models provide a sound and coherent founda- 
tion for dealing with the noise and uncertainty encountered in most real- 
world domains. Bayesian networks are a language for representing com- 
plex probabilistic models in a compact and natural way. A Bayesian 
network can be used to reason about any attribute in the domain, given 
any set of observations. It can thus be used for a variety tasks, including 
prediction, explanation, and decision making. The probabilistic seman- 
tics also gives a strong foundation for the task of learning models from 
data. Techniques currently exist for learning both the structure and the 
parameters, for dealing with missing data and hidden variables, and for 
discovering causal structure. 

One of the main limitations of Bayesian networks is that they represent 
the world in terms of a fixed set of “attributes” . Like propositional logic, 
they are incapable of reasoning explicitly about entities, and thus cannot 
represent models over domains where the set of entities and the relations 
between them are not fixed in advance. As a consequence, Bayesian net- 
works are limited in their ability to model large and complex domains. 
Probabilistic relational models are a language for describing probabilistic 
models based on the significantly more expressive basis of relational lo- 
gic. They allow the domain to be represented in terms of entities, their 
properties, and the relations between them. These models represent the 
uncertainty over the properties of an entity, representing its probabilistic 
dependence both on other properties of that entity and on properties of 
related entities. They can even represent uncertainty over the relational 
structure itself. Some of the techniques for Bayesian network learning can 
be generalized to this setting, but the learning problem is far from sol- 
ved. Probabilistic relational models provide a new framework, and new 
challenges, for the endeavor of learning relational models for real-world 
domains. 



1 Relational Logic 

Relational logic has traditionally formed the basis for most large-scale knowledge 
representation systems. The advantages of relational logic in this context are 
obvious: the notions of “individuals”, their properties, and the relations between 
them provide an elegant and expressive framework for reasoning about many 
diverse domains. The use of quantification allows us to compactly represent 
general rules, that can be applied in many different situations. For example. 
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when reasoning about genetic transmission of certain properties (e.g., genetically 
transmitted diseases), we can write down general rules that hold for all people 
and many properties. 

One of the most significant gaps in the expressive power of the logical fra- 
mework, and one of the most significant barriers to its use in many real-world 
applications, is its inability to represent and reason with uncertain and noisy in- 
formation. Uncertainty is unavoidable in the real world: our information is often 
inaccurate and always incomplete, and only a few of the “rules” that we use for 
reasoning are true in all (or even most) of the possible cases. 

2 Probabilistic Models 

This limitation, which is crucial in many domains (e.g., medical diagnosis), has 
led over the last decade to the resurgence of probabilistic reasoning in AI. Proba- 
bility theory models uncertainty by assigning a probability to each of the states 
of the world that the agent considers possible. Most commonly, these states are 
the set of possible assignments of values to a set of attributes or random varia- 
bles. For example, in a medical expert system, the random variables could be 
diseases, symptoms, and predisposing factors. A probabilistic model specifies a 
joint distribution over all possible assignments of values to these variables. Thus, 
it specifies implicitly the probability of any event. 

As a consequence, unlike standard predictive models, a probability distribu- 
tion is a model of the domain as a whole, and can be used to deal with a much 
richer range of problems. It is not limited to conclusions about a prespecified 
set of attributes, but rather can be used to answer queries about any variable 
or subset of variables. Nor does it require that the values of all other variables 
be given; it applies in the presence of any evidence. For example, a probabilistic 
model can be used to predict the probability that a patient with a history of 
smoking will get cancer. As new evidence about symptoms and test results is 
obtained, Bayesian conditioning can be used to update this probability, so that 
the probability of cancer will go up if we observe heavy coughing. The same mo- 
del is used to do the predictive and the evidential reasoning. Most impressively, 
a probabilistic model can perform explaining away, a reasoning pattern that is 
very common in human reasoning, but very difficult to obtain in other formal 
frameworks. Explaining away uses evidence supporting one cause to decrease 
the probability in another, not because the two are incompatible, but simply 
because the one cause explains away the evidence, removing the support for the 
other cause. For example, the probability of cancer will go down if we observe 
high fever, which suggests bronchitis as an alternative explanation for the cough. 
The same probabilistic model supports all of these reasoning patterns, allowing 
it to be used in many different tasks. 

The traditional objection to probabilistic models has been their computatio- 
nal cost. A complete joint probability distribution over a set of random variables 
must specify a probability for each of the exponentially many different instantia- 
tions of the set. Thus, a naive representation is infeasible for all but the simplest 
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domains. Bayesian networks El use the underlying structure of the domain to 
overcome this problem. The key insight is the locality of influence present in 
many real-world domains: a variable is directly influenced by only a few others. 
For example, smoking causes lung cancer, which can be detected by an X-ray. 
But the effect of smoking on the X-ray is an indirect one: if we know whether 
the patient has cancer, the outcome of the X-ray no longer depends on the pa- 
tient’s smoking. A Bayesian network (BN) captures this insight graphically; it 
represents the distribution as a directed acyclic graph whose nodes represent the 
random variables and whose edges represent direct dependencies. The semantics 
of such a network is that each node is conditionally independent (in the proba- 
bilistic sense) of its non-descendants given values for its parents. This allows a 
very concise representation of the joint probability distribution over these ran- 
dom variables: we associate with each node a conditional probability table, which 
specifies for each node X the probability distribution over the values of X given 
each combination of values for its parents. The conditional independence as- 
sumptions associated with the BN imply that these numbers suffice to uniquely 
determine the probability distribution over these random variables. 

Their probabilistic semantics and compact representation have also allowed 
statistical learning techniques to be used effectively in the task of learning 
Bayesian networks from data. Standard statistical parameter estimation tech- 
niques can be used for learning the parameters of a given network. Scoring 
functions such as minimum description length (MDL) and Bayesian marginal 
likelihood can be used to evaluate different candidate BN structures relative to 
a training set |S|, allowing the construction of heuristic search algorithms for 
learning structure from data. These techniques allow a BN structure to be di- 
scovered from data. The learned structure can often give us insight about the 
nature of the connections between the variables in the domain. Furthermore, 
the graph structure can sometimes be interpreted causally (H, allowing us to 
reach conclusions about the consequences of intervening (acting) in the domain. 
Statistical learning techniques are also robust to the presence of missing data 
and hidden variables. Techniques such as EM (expectation maximization) can be 
used to deal with this issue in the context of parameter estimation 0 and have 
recently even be generalized to the harder problem of structure selection |^. 



3 Probabilistic Relational Models 

Over the last decade, BNs have been used with great success in a wide variety of 
real-world and research applications. However, despite their great success, BNs 
are often inadequate for as representation for large and complex domains. A 
BN for a given domain involves a prespecified set of random variables, whose 
relationship to each other is fixed in advance. Hence, a BN cannot be used to 
deal with domains where we might encounter several entities in a variety of 
configurations. This limitation of Bayesian networks is a direct consequence of 
the fact that they lack the concept of an “object” (or domain entity). Hence, 
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they cannot represent general principles about multiple similar objects which 
can then be applied in multiple contexts. 

Probabilistic relational models (PRMs) extend Bayesian networks with the 
concepts of individuals, their properties, and relations between them. In a way, 
they are to Bayesian networks as relational logic is to propositional logic. A 
PRM has a coherent formal semantics in terms of probability distributions over 
sets of relational logic interpretations. Given a set of ground objects, a PRM 
specifies a probability distribution over a set of interpretations involving these 
objects (and perhaps other objects). 

3.1 Basic Language 

Our discussion of PRMs is based on the presentation in BEH It also accom- 
modates and generalizes the probabilistic logic programming approach of iiui 

|b^ . 

The basic entities in a PRM are objects or domain entities. Objects in the 
domain are partitioned into a set of disjoint classes Xi , . . . , A„. Each class is 
associated with a set of attributes A{Xi). Each attribute Aj G A{Xi) takes on 
values in some fixed domain of values V{Aj). We use XA to denote the attribute 
A of an object in class X. The other main component of the semantics is a set 
of typed relations i?i, . . . , Rm- The classes and relations define the schema (or 
vocabulary) for our model. 

It will be useful to define a directed notion of a relation, which we call a 
slot. If R{Xi, . . . ,Xk) is any relation, we can project R onto its i-th and j-th 
arguments to obtain a binary relation p(Xi,Xj), which we can then view as a 
slot of Xi. For any x in Xi, we let x.p denote all the elements y in Xj such that 
p{x,y) holds. Objects in this set are called p-relatives of x. We can concatenate 
slots to form longer slot chains t = p\. ■ ■ ■ .pm, defined by composition of binary 
relations. (Each of the pi's in the chain must be appropriately typed.) We use 
X.T to denote the set of objects that are r-relatives of an object in class X. 

Consider, for example, a simple genetic model of the inheritance of a single 
gene that determines a person’s blood type. Each person has two copies of the 
chromosome containing this gene, one inherited from her mother, and one inhe- 
rited from her father. There is also a possibly contaminated test that attempts 
to recognize the person’s blood type. Our schema contains two classes Person 
and Blood-Test, and three relations Father, Mother, and Test-of. Attributes of 
Person are Gender, P-Chromosome (the chromosome inherited from the father), 
M-Chromosome (inherited from the mother). The attributes of Blood- Test are 
Contaminated and Result. 

An instance I of a schema is simply a standard relational logic interpretation 
of this vocabulary. For an object T and one of its attributes A, we use T^.a to 
denote the value of x.A in T. A probabilistic relational model (PRM) defines a 
probability distribution over a set of instances of a schema. Most simply, we 
assume that the set of objects and the relations between them are fixed, i.e., 
external to the probabilistic model. Then, the PRM defines only a probability 
distribution over the attributes of the objects in the model. More precisely. 
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a skeleton structure ct of a relational schema is a partial specification of an 
instance of the schema. It specifies the set of objects for each class and 

the relations that hold between the objects. However, it leaves the values of the 
attributes unspecified. A PRM then specifies a probability distributions over 
completions X of the skeleton. 

The PRM specifies the probability distribution using the same underlying 
principles used in specifying Bayesian networks. The assumption is that each 
of the random variables in the PRM — in this case the attributes x.a of the 
individual objects x — is directly influenced by only a few others. The PRM 
therefore defines for each x.a a set of parents, which are the direct influences 
on it, and a local probabilistic model that specifies the dependence on these 
parents. However, there are two primary differences between PRMs and BNs. 
First, a PRM defines the dependency model at the class level, allowing it to be 
used for any object in the class. In some sense, it is analogous to a universally 
quantified statement. Second, the PRM explicitly uses the relational structure of 
the model, in that it allows the probabilistic model of an attribute of an object to 
depend also on attributes of related objects. The specific set of related objects 
can vary with the skeleton a; the PRM specifies the dependency in a generic 
enough way that it can apply to an arbitrary relational structure. 

Formally, a PRM consists of two components: the qualitative dependency 
structure, S, and the parameters associated with it, 9g. The dependency struc- 
ture is defined by associating with each attribute X.A a set of parents Pa(A.A). 
These correspond to formal parents; they will be instantiated in different ways 
for different objects in X. Intuitively, the parents are attributes that are “direct 
influences” on X.A. 

We distinguish between two types of formal parents. The attribute X.A can 
depend on another probabilistic attribute B of X. This formal dependence in- 
duces a corresponding dependency for individual objects: for any object x in 
0'^{X), x.a will depend probabilistically on x.b. The attribute X.A can also 
depend on attributes of related objects X.t.B, where r is a slot chain. To un- 
derstand the semantics of this formal dependence for an individual object x, 
recall that x.t represents the set of objects that are r-relatives of x. Except in 
cases where the slot chain is guaranteed to be single- valued, we must specify the 
probabilistic dependence of x.a on the multiset {y.b : y €. x.t}. 

The notion of aggregation from database theory gives us precisely the right 
tool to address this issue; i.e., x.a will depend probabilistically on some aggre- 
gate property of this multiset. There are many natural and useful notions of 
aggregation: the mode of the set (most frequently occurring value); mean value 
of the set (if values are numerical); median, maximum, or minimum (if values 
are ordered); cardinality of the set; etc. More formally, our language allows a 
notion of an aggregate 7 ; 7 takes a multiset of values of some ground type, and 
returns a summary of it. The type of the aggregate can be the same as that of 
its arguments. However, we allow other types as well, e.g., an aggregate that 
reports the size of the multiset. We allow X.A to have as a parent ^{X.t.B)-, the 
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semantics is that for any x G X, x.a will depend on the value of j^x-T-b). We 
define V{'-^{X.T.h)) in the obvious way. 

Given a set of parents Pa{X.A) for X.A, we can define a local probability 
model for X.A. We associate X.A with a conditional probability distribution that 
specifies P{X.A \ Pa{X.A)). Let U be the set of parents of X.A. Each of these 
parents Ui — whether a simple attribute in the same relation or an aggregate of 
a set of r relatives — has a set of values V{Ui) in some ground type. For each 
tuple of values u G E(U), we specify a distribution P{X.A \ u) over V(X.A). 
This entire set of parameters comprises 9s . 



3.2 PRM Semantics 

Given any skeleton, we have a set of random variables of interest: the attribu- 
tes of the objects in the skeleton. We want the PRM to specify a probability 
distribution over the possible assignments of values to these random variables. 
In order to guarantee that the local probability models associated with these 
variables define a coherent distribution, we must ensure that our probabilistic 
dependencies are acyclic, so that a random variable does not depend, directly 
or indirectly, on its own value. Gonsider the parents of an attribute X.A. When 
X.B is a parent of X.A, we define an edge x.b x.a; when j{X.t.B) is a parent 
of X.A and y G x.t, we define an edge y.b x.a. We say that a dependency 
structure S is acyclic relative to a skeleton a if the directed graph defined by ho- 
over the variables x.a is acyclic. In this case, we are guaranteed that the PRM 
defines a coherent probabilistic model over complete instantiations X consistent 
with a. 



P{I\u,S,9s) = n n n P(X,.a I ^Pa(a:.a)) (1) 

Xi A^A{Xi) (Xi) 

This construction allows us to check whether a dependency structure S is 
acyclic relative to a fixed skeleton a. However, we often want stronger guarantees: 
we want to ensure that our dependency structure is acyclic for any skeleton that 
we are likely to encounter. How do we guarantee acyclicity for an arbitrary 
skeleton? A simple approach is to ensure that dependencies among attributes 
respect some order, i.e., they are stratified. More precisely, we say that X.A 
directly depends on Y.B if either X = Y and X.B is a parent of X.A, or j(X.t.B) 
is a parent of X.A and the r-relatives of X are of class Y. We require that X.A 
directly depends only on attributes that precede it in the order. 

While this simple approach clearly ensures acyclicity, it is too limited to cover 
many important cases. In our genetic model, for example, the genotype of a per- 
son depends on the genotype of her parents; thus, we have P er s on. P- Chromosome 
depending directly on Person. P-Chromosome, which clearly violates the requi- 
rements of our simple approach. In this model, the apparent cyclicity at the 
attribute level is resolved at the level of individual objects, as a person cannot 
be his/her own ancestor. That is, the resolution of acyclicity relies on some prior 
knowledge that we have about the domain. We want to allow the user to give 
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us information such as this, so that we can make stronger guarantees about 
acyclicity. 

We allow the user to assert that certain slots TZga = {pi, • ■ ■ ,Pk\ are gua- 
ranteed acyclic] i.e., we are guaranteed that there is a partial ordering ^ga such 
that if 2 / is a p-relative for some p G TZga of x, then y -<ga x. We say that r is 
guaranteed acyclic if each of its components p’s is guaranteed acyclic. This prior 
knowledge allows us to guarantee the legality of certain dependency models 0. 
We start by building a graph that describes the direct dependencies between the 
attributes. In this graph, we have a yellow edge X.B X.A if X.B is a parent 
of X.A. If 'y(X.T.B) is a parent of X.A, we have an edge Y.B X.A which is 
green if r is guaranteed acyclic and red otherwise. (Note that there might be 
several edges, of different colors, between two attributes). The intuition is that 
dependency along green edges relates objects that are ordered by an acyclic or- 
der. Thus these edges by themselves or combined with intra-object dependencies 
(yellow edges) cannot cause a cyclic dependency. We must take care with other 
dependencies, for which we do not have prior knowledge, as these might form a 
cycle. This intuition suggests the following definition: A (colored) dependency 
graph is stratified if every cycle in the graph contains at least one green edge 
and no red edges. It can be shown that if the colored dependency graph of S and 
TZga is stratified, then for any skeleton a for which the slots in TZga are jointly 
acyclic, S defines a coherent probability distribution over assignments to a. 

This notion of stratification generalizes the two special cases we considered 
above. When we do not have any guaranteed acyclic relations, all the edges in the 
dependency graph are colored either yellow or red. Thus, the graph is stratified 
if and only if it is acyclic. In the genetics example, all the relations would be in 
TZga. Thus, it suffices to check that dependencies within objects (yellow edges) 
are acyclic. Checking for stratification of a colored graph can be done, using 
standard graph algorithms, in time linear in the number of edges in the graph. 



4 Inference 

PRMs are significantly more expressive than Bayesian networks. Performing in- 
ference on a BN — answering queries about one or more random variables given 
evidence about others — is already an expensive operation (the task is NP- 
hard ^). It is natural to suspect that the additional expressive power of PRMs 
might make the inference problem significantly harder. Surprisingly, we have 
strong evidence that the additional expressive power helps, rather than hinders, 
to perform the inference task. 

A PRM makes explicit two types of structure which are often present in a 
BN but only implicitly: encapsulation of influence, and model reuse. As we will 
see, making this structure explicit allows the inference algorithm to exploit it, 
and often to achieve better performance. 

The random variables in a BN are typically induced by more than one “ob- 
ject” present in the model; but having no notion of an “object”, the BN has 
no way of making that knowledge explicit. The PRM does make this knowledge 
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explicit, and it turns out to be extremely useful. In practice, it is often the case 
that an attribute of an object is influenced mostly by other attributes of the 
same object; there are relatively few inter-object dependencies. In general, BN 
inference algorithms [S| use a “divide and conquer” approach, doing localized 
computation over the graphical structure of the BN and putting the results to- 
gether. By making the locality structure explicit, we can give our BN algorithm 
guidance on how to perform this partitioning effectively. 

The other useful structure is induced by the class structure. A PRM makes 
explicit the fact that several “chunks” of the model are derived from the same 
probabilistic model — the class model. If we have multiple objects of the same 
class, without any evidence telling us that they are necessarily different, we can 
often do a single inference subroutine for one and reuse it for the others. In large 
structured domains with multiple objects, the savings can be considerable. 

In 1 1 we present experiments with this approach on the challenging real- 
world domain of military situation assessment. We show that by using the struc- 
ture of the PRM, we can gain orders of magnitude savings over the straight- 
forward BN inference algorithms. In particular, we constructed a PRM for this 
domain, and considered its behavior over a skeleton with a large number of 
objects. The BN over the attributes of these objects has over 5500 random va- 
riables. A standard BN inference algorithm, applied to this BN, took over twenty 
minutes to answer a query. The PRM inference algorithm that takes advantage 
of the additional structure took nine seconds. 



5 Learning 

One of the main advantages of probabilistic model is that the same representa- 
tion language accommodates both reasoning and learning. As discussed in the 
introduction, the underlying probabilistic framework provides many advantages, 
including the availability of: good statistical parameter estimation techniques; 
techniques for dealing with missing data; and well-motivated scoring functions 
for structure selection. 

PRMs are built on the same foundations as BNs. They share the same under- 
lying probabilistic semantics and the use of local independence models to allow 
compact representation of complex distributions. This similarity allows many of 
the BN learning techniques developed over the last few years to be extended 
to PRMs. As a first step |^, we have shown how to apply EM (Expectation 
Maximization |2|) to the problem of learning parameters 6s for a PRM whose 
dependence structure S is known. More recently 0, we have attacked the more 
challenging problem of learning the dependency structure S from data. Given 
a relational database as a training set, our algorithm discovers probabilistic de- 
pendencies between attributes of related entities. For example, in a database of 
movies and actors, it learned that the Role-Type played by an actor in a movie 
depends on the Gender of the actor and the Genre of the movie. 
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6 Structural Uncertainty 

So far, we have assumed that the skeleton is external to the probabilistic model. 
In fact, our framework can be extended to accommodate a generative model 
over skeletons as well as their properties. We now provide a brief sketch of this 
extension, which is fully described in [7fT!7| . 

Restricting attention to binary relations, we define a probability distribution 
over the relational structure relating to slots of a given class. For example, let us 
assume that the class Professor has a slot Student. We can introduce an attribute 
into the class called num(Student), which takes on integer values in some finite 
range 0, . . . , fc. The value of this attribute represents the number of students 
that the professor has. Now, consider a particular professor in this class. Each 
choice of value for this attribute is associated with a different skeleton, where the 
professor has a different set of students. We can specify a probability distribution 
over the values of this attribute, which induces a probabilistic model over the 
skeletons of this schema. 

Our framework allows this uncertainty over structure to interact with the 
model in interesting ways. In particular, we view the attribute num{Student) as 
a random variable, and it can be used in the same ways as any other random 
variable: it can depend probabilistically on the values of other attributes in 
the model (both of the associated object and of related ones) and can influence 
probabilistically the values of attributes in the model. For example, the attribute 
num{Student) can depend on the amount of funding — e.g., via an aggregate 
Sum over Professor. Contract. Amount, and can influence their stress level. 

Note that the semantics of this extension are significantly more subtle. A set 
of objects and relations between them no longer specifies a complete schema. 
For example, assume we are given a single professor and one student for her. 
A priori, the PRM may give positive probability to instances X where the pro- 
fessor has zero students, one student, two students, or more. Instances where 
the professor has zero students are inconsistent with our evidence that the pro- 
fessor has at least one. We must therefore use Bayesian conditioning to update 
the distribution to accommodate this new fact. There are also instances where 
the professor has more than one student. In general, these are not inconsistent 
with our observations. They will merely contain new “generic” objects of the 
Student class, about which nothing is known except that they are students of 
this particular professor. 

In | |7ll2j . we discuss some other forms of representing uncertainty over the 
relational structure of the model. We also show how some of the ideas for gua- 
ranteeing acyclic model structure can be extended to this richer setting. 

7 Conclusions and Further Directions 

PRMs provide a formal foundation for a rich class of models that integrates the 
expressive power of Bayesian networks and of relational models. On the one hand, 
they allow Bayesian networks to scale up to significantly more complex domains. 
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On the other, they provide a coherent and robust treatment of uncertainty, and 
can therefore deal with domains where deterministic first-order frameworks could 
not reasonably be applied. As we showed in PRMs allow existing knowledge 
bases to be annotated with probabilistic models, greatly increasing their ability 
to express meaningful knowledge in real-world applications. 

By carefully designing the language features of PRMs and their semantics, 
we have managed to achieve this increase in expressive power without losing the 
properties that made Bayesian networks so attractive: the sound probabilistic 
semantics, the ability to answer an entire range of queries, and the existence 
of effective inference and learning algorithms that exploit the structure. Indeed, 
we have shown that by making this higher level structure explicit, we allow 
inference algorithms to exploit it and achieve even better performance than on 
the equivalent BN. 

PRMs provide a new coherent framework for combining logical and proba- 
bilistic approaches. They therefore raise many new challenges. Some of the the 
most important of these are in the area of discovering structure from complex 
data; some important tasks include: automated discovery of hidden variables, 
discovering causal structure, automatically learning a class hierarchy, and more. 
Solving these problems and others will almost certainly require an integration 
of techniques from probabilistic learning and from inductive logic programming. 
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Abstract. Data mining aims at trying to locate interesting patterns or 
regularities from large masses of data. Data mining can be viewed as part 
of a data analysis or knowledge management. In data analysis tasks one 
can see a continuous spectrum of information needs, starting from very 
simple database queries (“what is the address of customer NN”), moving 
to more complex aggregate information (“what are the sales by product 
groups and regions” ) to data mining type of queries ( “give me interesting 
trends on sales”). This suggests that it is useful to view data mining as 
querying the theory of the database, i.e., the set of sentences that are true 
in the database. An inductive database is a database that conceptually 
contains in addition to normal data also all the generalizations of the data 
from a given language of descriptors. Inductive databases can be viewed 
as analogues to deductive databases; deductive databases conceptually 
contain all the facts derivable from the data and the rules. In this talk 
I describe a formal framework for inductive databases and discuss some 
theoretical and practical problems in the area. 
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1 Introduction 

This talk will revisit some important elements of ML lore, focusing on the design 
of classifier-learning systems. Within ML, the key desiderata for such systems 
have been predictive accuracy and interpretability. Although Provost, Fawcett 
and Kohavi (1998) have shown that accuracy alone is a poor metric for comparing 
learning systems, it is still important in most real-world applications. The quest 
for intelligibility, stressed from earliest days by Michie, Michalski and others, is 
now crucial for those data-mining applications whose main objective is insight. 
Scalability is also vital if the learning system is to be capable of analyzing the 
burgeoning numbers of instances and attributes in commercial and scientific 
databases. 

The design of classifier-learning systems is guided by several perceived truths, 
including: 

— Learning involves generalization. 

— Most ML models are structures (exceptions being Naive Bayes and some 
instance-based models) . 

— A model is an element in a lattice of possible models, so search is unavoidable. 

— General-to-specific and specific-to-general search are both important - most 
systems will use both. 

— Similar instances probably belong to the same class (the similarity assump- 
tion) . 

— Learned models should not blindly maximize accuracy on the training data, 
but should balance resubstitution accuracy against generality, simplicity, in- 
terpretability, and search parsimony. 

— Error can be decomposed into components arising from bias and variance, 
thereby helping to understand the behavior of learning systems. 

We might all accept these, but that does not mean that their consequences and 
possibilities have been thoroughly explored. I would like to illustrate this with 
four mini-topics relevant to the design of classifier-learning systems: creating 
structure, removing excessive structure, using search effectively, and finding an 
appropriate tradeoff between accuracy and complexity. 

* This extended abstract also appears in the Proceedings of the Sixteenth International 
Conference on Machine Learning, published by Morgan Kaufmann. 
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2 Structured Models 

Two commonly-used approaches used in ML are divide- and- conquer, which pro- 
duces tree-structured models, and cover, which generally yields sequence models 
such as rulesets and Horn Clause programs. Both divide-and-conquer and cover 
can transform any parametric model class into a structured model class, reducing 
its bias but at the same time increasing its variance. 

In propositional learning, model components such as leaves or rules are as- 
sociated with contiguous regions in the instance space, and can thus be justified 
under the similarity assumption. But “closeness” in the instance space is not the 
only way to measure similarity; two instances can also be regarded as similar if 
they are mapped to the same output by some function. For example, consider 
the application of divide-and-conquer to Naive Bayes. Whereas Kohavi’s (1996) 
NBTree generates a tree-structured model based on the familiar regional parti- 
tion of the instance space, an alternative divides the instances on the basis of 
the class predicted by the current Naive Bayes model. 

3 Right-Sizing Models 

The downside of structured models is their potential to overfit the training data. 
In ML we generally accept that it is better to construct a model and then post- 
process the structure to remove unhelpful parts. The methods used to decide 
whether structure is warranted can be grouped into 

— syntactic heuristics such as MML/MDL, cost-complexity pruning, and struc- 
tural risk minimization, and 

— non-syntactic heuristics such as cross-validation, reduced error pruning, and 
pessimistic pruning. 

MML/MDL is appealing - it has a firm basis in theory, does not require 
construction of additional models, uses all the training data, and avoids strong 
assumptions. However its performance depends critically on the way in which the 
conceptual messages are encoded. For example, MDL fares poorly in evaluations 
conducted by Kearns, Mansour, Ng, and Ron (1997). Wallace, the founder of 
MML, has questioned the coding approach used by Kearns et al and shows that 
an alternative scheme produces much better results. 

4 Computation-Intensive Learning 

Structured models imply vast model spaces, so early ML algorithms used greedy 
heuristics in order to make the search tractable. As cycles have become expo- 
nentially cheaper, researchers have experimented with more thorough search. 
The results have been rather unexpected - theories have been found that are 
both simpler and fit the data better, but predictive accuracy has often suffered. 
Examples are provided by Webb’s (1993) work with finding optimal rules, and 
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Cameron- Jones’ and my (1995) experiments varying the amount of search. (In 
hindsight, we might have expected this, since the phenomenon of overtraining is 
well-known in Neural Networks.) 

Domingos (1998) has recently developed a “process-oriented” approach that 
relates true error rate directly to search complexity, much as MML/MDL relates 
it to theory complexity. An ideal formalism would take into account resubstitu- 
tion error, theory complexity, and search complexity. 



5 Boosting 

Freund and Schapire’s (1996) boosting has been acclaimed by ML researchers 
and statisticians alike. All empirical studies have found that boosting usually 
increases predictive accuracy, often dramatically. 

On the other hand, boosted classifiers can become totally opaque. For in- 
stance, it is possible to construct a single decision tree that is exactly equivalent 
to a boosted sequence of trees, but the single tree can become enormous even 
when the sequence is as short as three. 

Is there a middle ground? I will discuss a fledgling technique for using boo- 
sting to improve the predictive accuracy of a rule-based classifier without increa- 
sing its complexity. A more ambitious goal would be to And a comprehensible 
approximation to a boosted classifler. 

6 Looking Ahead 

ML is a mature held with unique features that distinguish it from related disci- 
plines such as Statistics. My list of its most valuable assets includes: 

— A head start in relational learning (one decade at least). 

~ A substantial body of theory (eg, PAC-learnability, weak learnability, iden- 
tification in the limit). 

— Links between theory and system development (eg, boosting) . 

— An understanding of search and heuristics (a legacy of ML’s roots in AI). 

— A lack of hang-ups about optimality. 

If I were given one wish regarding future directions in ML, it would be that 
we pay more attention to interpretability. For example, I would like to see: 

— A model of interpretability (or, alternatively, of opacity). 

— Approaches for constructing understandable approximations to complex 
models. 

— Ways of reformulating models to improve their intelligibility. 



Acknowledgements Thanks to Saso Dzeroski and Ivan Bratko for comments 
on a draft of this abstract. 
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Abstract. Our aim is to construct a perfect (i.e. minimal and opti- 
mal) ILP refinement operator for hypotheses spaces bounded below by a 
most specific clause and subject to syntactical restrictions in the form of 
input/output variable declarations (like in Progol). Since unfortunately 
no such optimal rehnement operators exist, we settle for a weaker form 
of optimality and introduce an associated weaker form of subsumption 
which exactly captures a first incompleteness of Progol’s rehnement ope- 
rator. We argue that this sort of incompleteness is not a drawback, as it 
is justihed by the examples and the MDL heuristic. 

A second type of incompleteness of Progol (due to subtle interactions 
between the requirements of non-redundancy, completeness and the va- 
riable dependencies) is more problematic, since it may sometimes lead 
to unpredictable results. We remove this incompleteness by construc- 
ting a sequence of increasingly more complex rehnement operators which 
eventually produces the hrst (weakly) perfect rehnement operator for a 
Progol-like ILP system. 



1 Introduction 

Learning logic programs from examples in Inductive Logic Programming (ILP) 
involves traversing large spaces of hypotheses. Various heuristics, such as infor- 
mation gain or example coverage, can be used to guide this search. A simple 
search algorithm (even a complete and non-redundant one) would not do, unless 
it allows for a flexible traversal of the search space, based on an external heu- 
ristic. Refinement operators allow us to decouple the heuristic from the search 
algorithm. 

In order not to miss solutions, the refinement operator should be (weakly) 
complete. In order not to revisit already visited portions of the search space it 
should also be non-redundant. Such weakly complete non-redundant refinement 
operators are called optimal. 

Various top-down ILP systems set a lower bound (usually called most spe- 
cific clause (MSC) or saturant) on the hypotheses space in order to limit its 
size. Syntactical restrictions in the form of mode declarations on the predicate 
arguments are also used as a declarative bias. 

Devising an optimal refinement operator for a hypotheses space bounded 
below by a MSC in the presence of input/output (±) variable dependencies is 
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not only a challenging issue given the subtle interactions of the above- men- 
tioned features, but also a practically important one since refinement operators 
represent the core of an ILP system. 

The Progol refinement operator for example, is incomplete in two ways. 
First, it is incomplete w.r.t. ordinary subsumption since each literal from the 
MSC has at most one corresponding literal in each hypothesis (only variabilized 
subsets of the MSC are considered as hypotheses). Rather than being a drawb- 
ack, we argue that this incompleteness is exactly the sort of behavior we would 
expect. In order to make this observation precise, we introduce a weaker form 
of subsumption under which the refinement operator is complete. 

The second type of incompleteness (see also example 30 of P]) is more proble- 
matic since it cannot be characterized in a clean way and since it depends on the 
ordering of mode declarations and examples. In order to achieve non-redundancy 
and at the same time obey the Tvariable dependencies imposed by the mode 
declarations, Progol scans the MSC left-to-right and non-deterministically de- 
cides for each literal whether to include it in (or exclude it from) the current 
hypothesis. A variabilized version of the corresponding MSC literal is included in 
the current hypothesis only if all its input (-I-) variables are preceded by suitable 
output (— ) variables. This approach is incomplete since it would reject a literal 
li that obtains a -|-variable from a literal Ij that will be considered only later: 

. . . , k{- ■ ■ , +X, , —X, •••),... 



although lj,li would constitute a valid hypothesis. 

Note that a simple idea like reordering the literals in the MSC would not 
help in general, since the MSC may exhibit cyclic variable dependencies while 
still admitting acyclic subsets. 

The following example illustrates the above-mentioned incompleteness: 



:- modehCl, p(+auiy, +t))? 
:- modebCl, g(+ainy, +t))? 
p(l,a). p(2,a). p(3,a). 

f(l,b). f(2,b). f(3,b). 

g(l,a). g(2,c). g(3,c). 

h(l,a). h(2,c). h(3,c). 



:- modebCl, f(-any, -t))? 
:- modebCl, h(-any, -t))? 
:-p(4,a) . 
f (4,b) . 

h(4,a) . 



As long as the mode declaration for g precedes that of h, Progol will pro- 
duce a MSC p (A, B) :- f(A,C), g(A,B), h (A, B) in which the g literal cannot 
obtain its -|- variables from h since the former precedes the latter in the MSC. 
Thus Progol will miss the solution p(A,B) :- h(A,C) , g(A,C) which can be 
found only if we move the mode declaration for h before that of g. This type of 
incompleteness may sometimes lead to unpredictable results and a reordering of 
the mode declarations will not always be helpful in solving the problem. 

Although Progol’s algorithm for constructing the MSC makes sure that each 
4-variable occurrence is preceded by a corresponding —variable, there may be 
several other —variable occurrences in literals ordered after the literal containing 
the 4-variable. These potential “future links” will be missed by Progol. In cases 
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of many such “future links” , the probability of the correct ordering being in the 
search space is exponentially low in the number of “future links” . 

For a very small number of literals in the body and/or a small variable 
depth, this incompleteness may not be too severe, especially if we order the mode 
declarations appropriately. The problem becomes important for hypotheses with 
a larger number of literals. 

2 Refinement Operators 

In order to be able to guide the search in the space of hypotheses by means of 
an external heuristic, we need to construct a refinement operator. For a top- 
down search, we shall deal with a downward refinement operator, i.e. one that 
constructs clause specializations. In the following we will consider refinement 
operators w.r.t. the subsumption ordering between clauses. 

Definition 1. Clause C subsumes clause D, C D iff there exists a substitu- 
tion 9 such that C6 C D (the clauses being viewed as sets of literals) . C properly 
subsumes D,C'^D iff C>D and D )( C. C and D are subsume-equivalent 
C ^ D iff C > D and D)^C. 

Lemma 2. 0/ For a most general literal L w.r.t. clause C (one with new and 
distinct variables), C properly subsumes CU {L} iff L is incompatible with all 
literals in C (i.e. it has a different predicate symbol). 

The somewhat subtle and counter-intuitive properties of subsumption are 
due to the incompatibility of the induced subsumption-equivalence relation ~ 
with the elementary operations of a refinement operator, such as adding a literal 
or performing a substitution. 

Remark. Note that not all reduced specializations D of a reduced clause C 
can be obtained just by adding one literal or by making a simple substitution 
{X/Y}. It may be necessary to add several literals and make several simple sub- 
stitutions in one refinement step, since each of these elementary operations ap- 
plied separately would just produce a clause that is subsume- equivalent with C. 

Definition S. p is a (downward) refinement operator iff for all clauses C, p 
produces only specializations of C : p{C) C {D \ C Y D). 

Definition 4. A refinement operator p is called 

— (locally) finite iff p{C) is finite and computable for all C . 

— proper iff for all C, p{C) contains no D ^ C . 

— complete iff for all C and D, C >- D ^ 3E G p*{C) such that E ^ D. 

— weakly complete iff p*(0) = the entire set of clauses. 

— non-redundant iff for all C\,C 2 and D, D G p*(Ci) n p*(C 2 ) ^ Ci G p*{C 2 ) 
orC2 G p*(Ci). 

— ideal iff it is locally finite, proper and complete. 
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— optimal iff it is locally finite, non-redundant and weakly complete. 

— minimal iff for all C, p{C) contains only downward covert and all its ele- 
ments are incomparable (Di,D 2 G p{C) Z?i ^ £>2 ^I'nd D 2 ^ Di). 

— (downward) cover set iff p{C) is a maximal set of non- equivalent downward 
covers of C. 

— perfect iff it is minimal and optimal. 

Theorem 5. 0/. For a language containing at least one predicate symbol of 
arity > 2, there exist no ideal (downward) refinement operators. 

The nonexistence of ideal refinement operators is due to the incompleteness 
of the (unique) cover set of a clause C, because of uncovered infinite ascending 
chains C > ... >■ Ei+i y Ei >- Ei-i >- ... E\ (for which there exists no 
maximal element E > Ei for all i, such that C > E). Indeed, since none of the 
Ei can be a downward cover of C, C cannot have a complete downward cover 
set. 

Every ideal refinement operator p determines a finite and complete downward 
cover set p^^iC) C p{C), obtained from p{C) by removing all E covered by some 
D e p{C)-. D>E. 



3 Ideal Versus Optimal Refinement Operators 

The subsumption lattice of hypotheses is far from being tree-like: a given clause 
D can be reachable from several incomparable hypotheses C\,C 2 , . . .. 

Theorem 6. A refinement operator cannot be both complete (a feature 0 / ideal 
operators) and non-redundant (a feature 0 / optimal operators). 

Proposition 7. For each ideal refinement operator p we can construct an opti- 
mal refinement operator p^°^ . 

p^°^ is obtained from p such that for D G p(C'i) n . . . n p{Cn) we have 3i.D G 

The efficiency of ideal and respectively optimal refinement operators depends 
on the density of solutions in the search space. Ideal operators are preferable for 
search spaces with dense solutions, for which almost any refinement path leads 
to a solution. In such cases, an optimal (non-redundant) operator might get quite 
close to a solution C but could backtrack just before finding it for reasons of non- 
redundancy (for example because C is scheduled to be visited on a different path 
and thus it avoids revisiting it). Despite this problem, the solutions are dense, 
so an optimal operator would not behave too badly, after all. 

On the other hand, optimal operators are preferable for search spaces with 
rare solutions, case in which a significant portion of the search space would be 
traversed and any redundancies in the search due to an ideal operator would be 
very time consuming. 
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D is a downward cover of C iff C D and no E satisfies C >~ E D. 
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Thus, unless we are dealing with a hypotheses space with a very high solu- 
tion density, we shall prefer an optimal operator over an ideal one. However, in 
practice we shall proceed by first constructing an ideal refinement operator p 
and only subsequently modifying it, as in proposition Q to produce an optimal 
operator p^°\ 

4 Refinement Operators for Hypotheses Spaces Bounded 
below by a MSC 

Limiting the hypotheses space below by a most specific (bottom) clause _L leads 
to a more efficient searchp This strategy has proven successful in state-of-the- 
art systems like Progol, which search the space of hypotheses C between the 
most general clause (for example the empty clause □) and the most specific 
clause T: □ ^ C ^ T (for efficiency reasons, the generality ordering employed 
is subsumption rather than full logical implication). 

Formalizing Progol’s behavior amounts to considering hypotheses spaces con- 
sisting of clause-substitution pairs C = {d{C),9±{C)) such that d{C)9±{C) C 
T. (For simplicity, we shall identify in the following d{C) with cB 

The following refinement operator is a generalization of Laird’s operator in 
the case of hypotheses spaces bounded below by a MSC T. 

D G p^^(C) iff either 

(1) D = C U {L'} with L G 1. {L' denotes a literal with the same predicate 
symbol as L, but with new and distinct variables), or 

(2) D = C{Xj/Xi\ with {X,/A,Xj/A} C 9j_{C). 

Note that in (2) we unify only variables Xi,Xj corresponding to the same va- 
riable A from T (since otherwise we would obtain a clause more specific than 
^)- 

is finite, complete, but improper. The lack of properness is due to the 
possibility of selecting a given literal L G T several times in the current hypo- 
thesis (using (1)). It can be easily shown that the nonexistence result El for ideal 
refinement operators can be restated in the case of hypotheses spaces bounded 
below by a MSC. Therefore, we cannot hope to convert p^^'^ (which is improper) 
to an ideal operator. 

On the other hand, the Progol implementation uses a slightly weaker refi- 
nement operator that considers each literal L G T for selection only once. This 
weaker operator is no longer complete, anyway not w.r.t. ordinary subsumption. 

For example, if T = . . . <— p(A,A), then the weaker refinement operator 
would construct only hypotheses with a single p-literal, such as Hi = ...<— 

^ In the following, we restrict ourselves for simplicity to refinement operators for flat- 
tened definite Horn clauses. 

^ In general, for a given clause C there can be several distinct substitutions Oi such that 
C6i C T. Viewing the various clause-substitution pairs (C, 9i) as distinct hypotheses 
amounts to distinguishing the T-Iiterals associated to each of the literals of C. 
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p{X, X), or i?2 = • • • <— p{X, Y), but it will never consider hypotheses with mul- 
tiple p-literals, like = . . . p{X, Y),p{Y, X), or i?4 = . . . p{X, Y),p{Y, Z), 
p{Z, W), etc. since such hypotheses could be constructed only if we would allow 
selecting (suitably variabilized versions of) the literal p{A, A) several times (and 
not just once, as in Progol). Note that is strictly more general (w.r.t. sub- 
sumption) than Hi, but also strictly more specific than H2'- i?2 ^ Hi. 



Since H3 is not even in the search space, Progol’s refinement operator is, in a 
way, incomplete.0 This incompleteness is due to the fact that _L is scanned only 
once for literal selection. It could be easily bridged by scanning _L repeatedly, 
so that a given literal can be selected several times. Unfortunately, in principle 
we cannot bound the number of traversals, although in practice we can set an 
upper bound. 

On the other hand, looking at the above-mentioned incompleteness more 
carefully, we are led to the idea that it is somehow justified by the examples and 
the MDL principle. 

In our previous example, if p{A, A) is the only p-literal in _L, then it may be 
that something like: p (a, a) . p(b,b). p(c,c). [ex] are the only examples. In 
any case, we could not have had examples like p(a,b) . p(b,a) . which would 
have generated _L = . . . <— p{A, B),p{B, A) instead of _L = . . . ^ p{A, A). Now, 
it seems reasonable to assume that a hypothesis like H = . . . p{X, Y),p{Y, X), 
although logically consistent with the examples [ex] , is not “required” by them. 
So, although Progol generally returns the most general hypothesis consistent 
with the examples, in the case it has to choose between hypotheses with multiple 
occurrences of the same literal from _L, it behaves as if it would always prefer 
the more specific one (the one with just one occurrence). A justification of this 
behavior could be that the more general hypotheses are not “required” by the 
examples. Also, the more general hypotheses (with several occurrences of some 
literal from _L) are always longer (while covering the same number of examples) 
and thus will be discarded by the MDL principle anyway. 

As we have already mentioned, the subtle properties of subsumption are due 
to the possibility of clauses with more literals being more general than clauses 
with fewer literals. This is only possible in the case of multiple occurrences of 
literals with the same predicate symbol (as for example in uncovered infinite 
ascending chains). 

In the following, we introduce a weaker form of subsumption, which exactly 
captures Progol’s behavior by disallowing substitutions that identify literals. 



Definition 8. Clause C weakly-subsumes clause D relative to T, C D iff 
CO C D for some substitution 9 that does not identify literals (i.e. for which there 
are no literals Li, L2 € C such that LiO = L2O) and such that 9±{D)oO = 0±{Cj. 

Note that although in the above example H^ >- Hi w.r.t. (ordinary) subsump- 
tion, they become incomparable w.r.t. weak subsumption because the substitu- 
tion 9 = {Y/Xj that ensures the subsumption relationship H^ Y Hi identifies 
the literals p{X, Y) and p{Y, X). 

^ It’s search however, is complete if we use the MDL heuristic. See below. 
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Although Progol’s refinement operator is in a way incomplete w.r.t. ordinary 
subsumption, it is complete w.r.t. weak subsumption. Disallowing substitutions 
that identify literals entails the following properties of weak subsumption. 

Proposition 9. If C D then \C\ < \D\ (where \C\ is the length of the clause 
C, i.e. the number of its literals). 

Lemma 10. (a) In the space of clauses ordered by weak subsumption there 
exist no infinite ascending chains ( and therefore no uncovered infinite ascending 
chains). 

(b) there exist no uncovered infinite descending chains. 

Lemma II Dl ai implies the existence of complete downward cover sets, which 
can play the role of ideal operators for weak subsumption. 

A form of subsumption even weaker than weak subsumption is subsumption 
under object identity n: toi D iff CO C D for some substitution 9 that 

does not unify variables of C. For example, p{X,Y) )(oi p(A, X), showing that 
subsumption under object identity is too weak for our purposes. 

A form of subsumption slightly stronger than weak subsumption (but still 
weaker than ordinary subsumption) is “non-decreasing” subsumption: C '^nd D 
iff CO C D and \C\ < \D\. (Such a substitution 0 can identify literals of C, but 
other literals have to be left out when going from C to D to ensure \C\ < \D\. This 
leads to somewhat cumbersome properties of “non-decreasing” subsumption.) 

Concluding, we have the following chain of increasingly stronger forms of 
subsumption: C >oi D ^ C D ^ C ^nd D ^ C > D. 

We have seen that Laird’s operator is locally finite, complete, but im- 
proper and that it cannot be converted to an ideal operator w.r.t. subsumption. 
However, it can be converted to an ideal operator w.r.t. weak subsumption by 
selecting each literal L G _L at most once: 

D € p^^(C) iff either 

(1) D = C U {L'} with L G _L \ C0x{C) {L' being L with new and distinct 
variables), or 

(2) D = C{Xj/Xi\ with {XjA,XjlA} C 0j.(C). 

Since literals from _L are selected only once, turns out proper, and alt- 
hough it looses completeness w.r.t. ordinary subsumption, it is still complete 
w.r.t. weak subsumption. 

4.1 Prom Ideal to Optimal Refinement Operators 

We have already seen (theorem EJ that, due to completeness, ideal refinement 
operators cannot be non-redundant and therefore optimal. As already argued 
in section 01 non-redundancy is extremely important for efficiency. We shall 
therefore transform the ideal refinement operator to an optimal operator 
by replacing the stronger requirement of completeness with the weaker 
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one of weak completeness. Non-redundancy is achieved (like in proposition 
by assigning a successor D G n . . . n p‘'^\Cn) to one and only one of 

its predecessors Cg D G p^^°\Ci) and Vj ^ i.D ^ p^l°\Cj). The refinement 
graph of such a non-redundant operator becomes tree-like. If the operator is 
additionally weakly complete, then every element in the search space can be 
reached through exactly one refinement chain. 

The essential cause for the redundancy of a refinement operator (like is 
the commutativity of the operations of the operator (such as literal addition (1) 
and elementary substitution (2) in the case of For example, D' U {Li,L 2 } 
can be reached both from D' U {T 2 } by adding L\ and from D' U {Ti} by adding 
L 2 . This redundancy is due to the commutativity of the operations of adding 
literal L\ and literal L 2 respectively. A similar phenomenon turns up in the case 
of substitutions. 

The assignment of D to one of its successors Ci is largely arbitrary, but has 
to be done for ensuring non-redundancy. This can be achieved by imposing an 
ordering on the literals in T and making the selection decisions for the literals 
Li G T in the given order. We also impose an ordering on the variable occurrences 
in T and make the unification decisions for these variable occurrences in the given 
order. Finally, we have to make sure that literal additions (1) do not commute 
with elementary substitutions (2). This is achieved by allowing only substitutions 
involving newly introduced (“fresh”) variables (the substitutions involving “old” 
variables having been performed already). 

Optimal refinement operators have been introduced in for the system 
Claudien. However, the refinement operator of Claudien is optimal only w.r.t. 
literal selection (which makes the problem a lot easier since variabilizations are 
not considered). One could simulate variabilizations by using explicit equality 
literals in the L>lab templates, but the resulting algorithm is no longer optimal 
since the transitivity of equality is not taken into account. For example, in case 
of a template containing . . . ^ \X = Y^X = Z,Y = Z], the algorithm would 
generate the following equivalent clauses: . . . X = Y,X = Z and . . . ^ X = 
Y,Y = Z. 

In the following, we construct an optimal operator (w.r.t. weak sub- 

sumption) associated to the ideal operator p^^ . We start by assuming an or- 
dering of the literals Lk G T: Lk precedes L; in this ordering iff A: < 1. This 
ordering will be used to order the selection decisions for the literals of T: we will 
not consider selecting a literal Lk if a decision for Li, I > k has already been 
made (we shall call this rule the ‘literal rule’). 

The ordering of the sel^tion decisions for literals also induces an ordering 
on the variable occurrenceaJ in T: X^ precedes Xj in this ordering {i < j) iff Xi 
is a variable occurrence in a literal selected before the literal containing Xj, or 
Xi precedes Xj in the same literal. 



® To each argument of each literal of T we assign a new and distinct variable Xi 
(denoting a variable occurrence). 
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To achieve non-redundancy, we will impose an order in which the substituti- 
ons are to be performed. Namely, we shall disallow substitutions {Xj/Xi], i < j 
if some Xj^ with k > j had already been involved in a previous substitution 
{Xk/Xi} (we shall refer to this rule as the ‘substitution rule’). Roughly spea- 
king, we make substitutions in increasing order of the variables involved. This 
ensures the nonredundancy of the operation of making substitutions. 

Having operators for selecting literals and making substitutions that are non- 
redundant if taken separately, does not ensure a non-redundant refinement ope- 
rator when the two operators are combined: we also have to make sure that literal 
additions (1) and elementary substitutions (2) do not commute. For example, 
if T = . . . ^ p{A,A),q{B), the clause ... ^ p{X,X),q{Z) could be obtained 
either by adding the literal q{Z) to Ci = . . . ^ p{X, X) or by making the sub- 
stitution {Y/X} in C 2 = . . . ^ p{X,Y),q{Z). The latter operation should be 
disallowed since it involves no “fresh” variables (assuming that q{Z) was the last 
literal added, Z is the only “fresh” variable). 

In general, we shall allow only substitutions involving at least a “fresh” varia- 
ble (we shall call this rule the ‘fresh variables rule’). The set of “fresh” variables 
is initialized when adding a new literal L with the variables of L. Variables in- 
volved in subsequent substitutions are removed from the list of “fresh” variables. 
Substitutions involving no fresh variables are assumed to have already been tried 
on the ancestors of the current clause and are therefore disallowed. 

The three rules above (the literal, substitution and fresh variables rules) are 
sufficient to turn p^^'' into an optimal operator w.r.t. weak subsumption. Howe- 
ver, the substitution rule forces us to explicitly work with variable occurrences, 
instead of just working with the substitutions 0_l{C) (if Xi and Xk are variable 
occurrences, then the substitution {Xk/Xi\ would eliminate X^ when working 
just with substitutions, and later on we wouldn’t be able to interdict a substi- 
tution {Xj/Xi} for j < k because we would no longer have Xk). 

Fortunately, it is possible to find a more elegant formulation of the substitu- 
tion and fresh variable rules combined. For each clause C, we shall keep a list 
of fresh substitutions X{C) that is initialized with 0±{L) when adding a new 
literal L. As before, we shall allow only substitutions {Xj/Xi\ involving a fresh 
variable Xj. But we eliminate from T{C) not just XjjA, but all Xk/B with 
k < j (‘prefix deletion rule’). This ensures that after performing a substitution 
of Xj, no substitution involving a “smaller” Xk will ever be attempted. This 
is essentially the substitution rule in disguise, only that now we can deal with 
substitutions instead of a more cumbersome representation that uses variable 
occurrences. 

The optimal refinement operator can thus be defined as: 

D G iff either 

(1) D = C U {L'} with L G T \ prefixj_(C'6*_L(C)), where prefix j_(C) = {Li G 

T I 3Lj G C such that i < j} and L' is L with new and distinct variables. 

0±{D) = 6^{C) U 9^{L'), B{D) = e^{L'), or 

(2) D = C[XJXi} with i < j, {XilA,XJA\ C 9 AC) and XJA G T{C). 

eW=AAHx,lxjA(k = Hc)\{xllB<,HC)\ k<)}' 
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Example 11 . Let _L = p{A^A,A,A) and C\ = p(Xi, X2, X3, X4), 6*j_(C'i) = 
{XiM, X^IA.X^IA, XJA}, E{Ci) = {X^IA,X^IA,X^IA,X^IA}. The sub- 

stitution {X^/Xi] eliminates the entire prefix {Xi/A^ X^jA, X-^jAA^ from T{C\)\ 
C2 = Ci{X^/X^} = p(Xi,X2,Xi,X4), 0J_(C2) = {Xi/A,X2M,X4/A}, 
^{^2) = {Xi/A\. Now, only the substitutions involving X4 are allowed for C2. 

5 Refinement Operators for Clauses with Variable 
Dependencies 

Most implemented ILP systems restrict the search space by providing mode 
declarations that impose constraints on the types of the variables. Also, variables 
are declared as either input (+) or output (— ). In this setting, valid clauses 
(clauses verifying the mode declarations) are ordered sets of literals such that 
every input variable +X in some literal Li is preceded by a literal Lj containing 
a corresponding output variable — A@ These variable dependencies induced by 
the ±mode declarations affect a refinement operator for an unrestricted language 
(like since now not all refinements according to the unrestricted are 

valid clauses. Nevertheless, we can use to construct an optimal refinement 
operator for clauses with ivariable dependencies by repeatedly using the 

“unrestricted” to generate refinements until a valid refinement is obtained. 
One-step refinements w.r.t. can thus be multi-step refinements w.r.t. 

(all intermediate steps being necessarily invalid). 

^(io)(±)(^) ^ I ^ pd°)(c),i?2 e ),..., e p^°'(i?n-i) 

such that Dn is valid but D\, . . . , D„-i are invalid}. 

(Here, a clause C is called valid iff it admits an ordering that satisfies the mode 
declarations. Such an ordering can easily be obtained with a kind of topological 
sort.) 

Note that is optimal, but non-minimal, since it may add more than 

one literal in one refinement step. For example, if T = . . . <— p(+A),q(—A), 
it would generate both Ci = . . . ^ q{—X),p{+X) and C2 = . . . <— q{—X) as 
one-step refinements of □, whereas it may seem more natural to view Ci as a 
one-step refinement of C2 (rather than of □). 

To obtain a minimal optimal (i.e. a perfect) refinement operator, we need 
to postpone the choice of unselectable (variabilizations of) literals until they 
become selectable, instead of making a selection decision right away. We also 
have to make sure that selectable literals obtain all their -h variables from previous 
—variables by making all the corresponding substitutions in one refinement step. 

(1) add a new literal (to the right of all selected ones) and link all its -1-variables 
according to “old” substitutions 

® This is true for literals in the body of the clause. Input variables in the head of 
the clause behave exactly like output variables of body literals. To simplify the 
presentation, we shall not explicitly refer to the head literal. 
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(2) make a substitution (involving at least a “fresh” variable) 

(3) wake-up a previously unselectable literal (to the left of the rightmost selected 
literal) . 

If several literals can be added at a given time, we shall add them in the 
order induced by _L. In other words, we shall disallow adding a literal L 2 < Li 
after adding Li if L 2 was selectable even before adding Li (since otherwise we 
would redundantly generate C,Li,L 2 both from C,Li by adding L 2 and from 
C, L2 by adding Li). 

On the other hand, the multiple -l-variables of several literals can be linked 
to the same —variable (“source”). For example, if _L = . . . ^ pi{+A),p 2 {+A), 
q{-A), then p{n) = {Cl}, p(Ci) = |C2,C3}, p{C 2 ) = {C4} and p(Ca) = 
p{Ci) = 0, where Ci = . . . ^ q{~X), C 2 = ■ ■ ■ ^ q{-X),pi{+X), C3 = . . . ^ 
q{-X),p 2 {+X), C4 = ... ^ q{-X),pi{+X),p 2 {+X). Note that both pi{+X) 
and p 2 {+X) use the same “source” q(—X). Also note that adding p 2 {+X) to 
C2 is allowed since p 2 > Pi, while adding px{+X) to C3 is disallowed because 
Pi < P 2 was selectable even before adding p 2 (to Ci). 

Using the notation 6*})(L') (and respectively 0([(L')) for the substitutions of 
-|-(—) variables of 0j_(L'), we obtain the following perfect {minimal and optimal) 
refinement operator for clauses with variable dependencies. 

D G p||^°^^(C) iff either 

(1) D = C \J (L'l with L G _L \ prefix_L(C0_L(C)), where L' is L with new and 
distinct —variable^, and -l-variables such that 9'^{L') C 9^{D) = 0j_(C) U 
91{L'). 

T{D) = 9f_{L')\{Xk/B G 9f_{L') \ k<j for some Xj/A G 9+{L')\J9f{L')}, 
or 

(2) D = C{Xj/Xi\ with i < j, (Xi/A, Aj/A| C 6 »_l(C) and Xj/A G X(C). 

0±{D) = 9^{C) U T{D) = T{C) \ {Xu/B G T{C) \ k < j}, or 

(3) D = C U {L'} with L G prefix j_(C0j_(C)) \ C9±{C), where L' is L with 
new and distinct —variables^, and -l-variables such that 9)[{L') C 9±{D) = 
9±{C) U 6>2(L'), and for all Li G C such that Li > L', firstly '{C) U {L'} is 
invalid, where firstj^.{C) are the literals added to C before Li. 

T{D) = 9f{L')\{Xk/B G 9f{L') \ k<j for some Xj/A G 9+{L')iJ9f{L')}. 

Observe that substitutions (2) involving —variables (controlled by X) are to 
be performed right away when T allows it, because later additions of woken- 
up literals will reset T and make those substitutions impossible. On the other 
hand, we can always wake-up literals (by solving their -l-variables) (3) after 

^ The refinement operator of Markus is not minimal (no description of this operator is 
available in jH], but see his footnote 4). As far as we know, our refinement operator 
is the first minimal and optimal one (w.r.t. weak subsumption). 

® The —variables of L' are preceded by all —variables of C in our variable ordering. 
(This variable ordering is dynamical since it is induced by the selection decisions for 
literals.) 
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making those substitutions. In other words, we firmly establish our “sources” 
(i.e. —variables) before waking-up the “consumers” (i.e. -I- variables) . 

The following example illustrates the functioning of 

Example 12. For _L = . . . ^ r{+B),qi{+A,-B),q 2 {+A,-B),p{-A), p(D) = 
{Cl}, p{C^) = {C2,C3|, p{C 2 ) = {C4,C'5|, p(C3) = (Cel, p{Ci) = {C 7 }, 
p{C,) = {C' 8 ,C 9 }, p(Cr) = (Ciol, p(Ce) = p{C^) = p(Cg) = p(Cio) = 0, where 



p{-X) C2 = ...^p{-X),q^{+X,-Y) 

p{-X),q2{+X, -Y) C4 = . . . ^ P{-X),qr{+X, -Y), r{+Y) 

p{-X),qii+X,-Y),q2{+X,-Y') Ce = . . . ^ p{-X),q2{+X,-Y),r{+Y) 

Cr = ...^ p{-X),qi{+X, -Y), r{+Y), q2{+X, -Y') 

Cs = ...^ p{-X),qi{+X, -Y), q2{+X, -Y) 

Cg = . . . ^ p{-X),qi{+X, -Y), q2{+X, -Y'),r{+Y') 

Cio = . . . ^ p{-X),qi{+X, -Y), q2{+X, -Y),r{+Y). 



Note that Cio ^ p{Cs) because q 2 > r in ± and firstg^(Cs) U {^(-l-T)} = 
C 2 U (r(-l-F)} = C 4 is valid, thereby violating the condition of step (3). In other 
words, we cannot wake up r{+Y) in Cs because it was already selectable in C 2 
(otherwise we would obtain the redundancy Cio € p(C'y) n p(C's)). For similar 
reasons, we don’t have C 7 € p{C^),C^ € p{C^) or Cg € p(C'e). 

On the other hand, Cio ^ p(Cg) because the substitution {Y' /Y} in Cg is 
disallowed by T{Cg) = 0 (the last literal added, r(-|-F"'), having no —variables). 

Note that the incompleteness of Progol’s refinement operator (which applies 
only steps ( 1 ) and ( 2 )) is due to obtaining the substitutions of -|-variables only 
from —variables of already selected literals, whereas they could be obtained 
from —variables of literals that will be selected in the future (as in step (3)). 
For example, if _L = . . . ^ p{—A),q{+A),r{—A), then Progol’s refinement of 
C = ... ^ p{—X) will miss the clause D = ... ^ p{—X),r{—Y),q{+Y) in 
which q obtains its +Y from r. 

We have implemented the refinement operators described in this paper and 
plan to use them as a component in a full ILP system. 
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Abstract. Divide-and-Conquer (DAC) and Separate-and-Conquer 
(SAC) are two strategies for rule induction that have been used exten- 
sively. When searching for rules DAC is maximally conservative w.r.t. 
decisions made during search for previous rules. This results in a very 
efficient strategy, which however suffers from difficulties in effectively 
inducing disjunctive concepts due to the replication problem. SAC on 
the other hand is maximally liberal in the same respect. This allows for 
a larger hypothesis space to be searched, which in many cases avoids 
the replication problem but at the cost of lower efficiency. We present 
a hybrid strategy called Reconsider-and-Conquer (RAC), which hand- 
les the replication problem more effectively than DAC by reconsidering 
some of the earlier decisions and allows for more efficient induction than 
SAC by holding on to some of the decisions. We present experimental 
results from propositional, numerical and relational domains demonstra- 
ting that RAC significantly reduces the replication problem from which 
DAC suffers and is several times (up to an order of magnitude) faster 
than SAC. 



1 Introduction 

The two main strategies for rule induction are Separate-and-Conquer and Divide- 
and-Conquer. Separate-and-Conquer, often also referred to as Covering, produ- 
ces a set of rules by repeatedly specialising an overly general rule. At each itera- 
tion a specialised rule is selected that covers a subset of the positive examples and 
excludes the negative examples. This is repeated until all positive examples are 
covered by the set of rules. The reader is referred to Q for an excellent overview 
of Separate-and-Conquer rule learning algorithms. Divide-and-Conquer produ- 
ces a hypothesis by splitting an overly general rule into a set of specialised rules 
that cover disjoint subsets of the examples. The rules that cover positive exam- 
ples only are kept, while the rules that cover both positive and negative examples 
are handled recursively in the same manner as the first overly general rule. 

For Separate-and-Conquer, the computational cost (measured as the number 
of checks to see whether or not a rule covers an example) grows quadratically in 
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the size of the example set, while it grows linearly using Divide-and-Conquer0 
This follows from the fact that Separate-and-Conquer searches a larger hypothe- 
sis space than Divide-and-Conquer 0. For any hypothesis in this larger space 
there is a corresponding hypothesis with identical coverage in the narrower space. 
Hence none of the strategies is superior to the other in terms of expressiveness. 
However, for many of the hypotheses within the narrower space there is a hy- 
pothesis with identical coverage but fewer rules within the larger space. Since 
the number of rules in a hypothesis provides a bound on the minimal num- 
ber of examples needed to find it, this means that Separate-and-Conquer often 
requires fewer examples than Divide-and-Conquer to find a correct hypothe- 
sis. In particular this is true in domains in which the replication problem IE! 
is frequent, i.e. when the most compact definition of the target concept con- 
sists of disjuncts whose truth values are (partially or totally) independent, e.g. 
p{xi,X2,X3, X4) <— (xi = 1 A 0:2 = 2 ) V (xa = 0 A 0:4 = 1 ). 

The two strategies can be viewed as extremes w.r.t. how conservative they 
are regarding earlier decisions when searching for additional rules. Divide-and- 
Conquer can be regarded as maximally conservative and Separate-and-Conquer 
as maximally liberal. We propose a hybrid strategy, called Reconsider-and- 
Conquer, which combines the advantages of Divide-and-Conquer and Separate- 
and-Conquer and reduces their weaknesses. The hybrid strategy allows for more 
effective handling of the replication problem than Divide-and-Conquer by recon- 
sidering some decisions made in the search for previous rules. At the same time 
it allows for more efficient induction than Separate-and-Conquer by holding on 
to some of the earlier decisions. 

In the next section we introduce some basic terminology. In section three, 
we formally describe Reconsider-and-Conquer, and in section four we present 
experimental results from propositional, numerical and relational domains de- 
monstrating that the hybrid approach may significantly reduce the replication 
problem from which Divide-and-Conquer suffers while at the same time it is se- 
veral times faster than Separate-and-Conquer. In section five, we discuss related 
work and in section six finally, we give some concluding remarks and point out 
directions for future research. 

2 Preliminaries 

The reader is assumed to be familiar with the logic programming terminology 

P. 

A rule r is a definite clause r = /i <— &i A . . . A bn, where h is the head, and 
61 A ... A is a (possibly empty) body. 

A rule g is more general than a rule s w.r.t. a set of rules B (background 
predicates), denoted g >b s, iff where Mp denotes the least 

Herbrand model of P (the set of all ground facts that follow from P). 

The coverage of a set of rules H, w.r.t. background predicates B and a set 
of atoms A, is a set Ahb = {a : a G A n AIhub}- We leave out the subscript 

assuming that the maximum number of ways to specialise a rule is fixed. 



r 
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B when it is clear from the context. Furthermore, if i? is a singleton H = {r}, 
then ^{r} is abbreviated to A^.. 

Given a rule r and background predicates B, a specialisation operator a 
computes a set of rules, denoted <JB{r), such that for all r' e crB(r), r >b r' . 
Again we leave out the subscript if B is clear from the context. 

Given a rule r, background predicates B, and a specialisation operator cr, a 
split of r is a set of rules s = {ri, . . . , r„}, such that G cr(r), for all 1 < t < n, 
and Furthermore, s is said to be a non-overlapping 

split if F = AIb for alH, j = 1, . . . , n such that i ^ j. 



3 Reconsider-and-Conquer 

Reconsider-and-Gonquer works like Separate-and-Gonquer in that rules are ite- 
ratively added to the hypothesis while removing covered examples from the set 
of positive examples. However, in contrast to Separate-and-Gonquer, which adds 
a single rule on each iteration, Reconsider-and-Gonquer adds a set of rules. The 
first rule that is included in this set is generated in exactly the same way as 
is done by Separate-and-Gonquer, i.e. by following a branch of specialisation 
steps from an initial rule into a rule that covers positive examples only. Howe- 
ver, instead of continuing the search for a subsequent rule from the initial rule, 
Reconsider-and-Gonquer backs up one step to see whether some other specialisa- 
tion step could be taken in order to cover some of the remaining positive examples 
(i.e. to complete another branch). This continues until eventually Reconsider- 
and-Gonquer has backed up to the inital rule. The way in which this set of 
rules that is added to the hypothesis is generated is similar to how Divide-and- 
Gonquer works, but with one important difference: Reconsider-and-Gonquer is 
less restricted than Divide-and-Gonquer regarding what possible specialisation 
steps can be taken when having backed up since the specialisation steps are 
not chosen independently by Divide-and-Gonquer due to that the resulting rules 
should constitute a (possibly non-overlapping) split. 

One condition used by Reconsider-and-Gonquer to decide whether or not to 
continue from some rule on a completed branch is that the fraction of positive 
examples among the covered examples must never decrease anywhere along the 
branch (as this would indicate that the branch is focusing on covering negative 
rather than positive examples). However, in principle both weaker and stronger 
conditions could be employed. The weakest possible condition would be that 
each rule along the branch should cover at least one positive example. This 
would make Reconsider-and-Gonquer behave in a way very similar to Divide-and- 
Gonquer. A maximally strong condition would be to always require Reconsider- 
and-Gonquer back up to the initial rule, making the behaviour identical to that 
of Separate-and-Gonquer. 

In Figure 1, we give a formal description of the algorithm. The branch of 
specialisation steps currently explored is represented by a stack of rules together 
with the positive and negative examples that they cover. Once a branch is com- 
pleted, i.e a rule that covers only positive examples is added to the hypothesis. 
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the stack is updated by removing all covered examples from the stack. Further- 
more, the stack is truncated by keeping only the bottom part where the fraction 
of positive examples covered by each rule does not decrease compared to those 
covered by the preceding rule. 



function Reconsider-and-Conquer(if"'", if ) 

H ■-% 

while if"*" 7 ^ 0 do 

r := an initial rule such that 0 

H' ~ Find-Rules(r, [], , E^) 

E+ ;= E+ \ E+, 

H — HUH' 

return H 

function Find-Rules(r, S, E^ , E~) 
if E~ 7 ^ 0 then 

r' a rule € cr(r) 

H := Find-Rules(r', (r, E+ , E~) ■ S, E+,E;,) 
else H := {r} 
repeat 

Update S w.r.t. H 
if S 7^ [] then 

Pop {r,E®,E^) from S 

if there is a rule r' £ a{r) such that — |g^^e| then 

H' ■- Find-Rules(r', (r, E®,E^) ■ S,E®,Ef,) 

H — HUH' 

S:={r, E®,E^)-S 
until S' = [] 
return H 



Fig. 1. The Reconsider-and-Conquer algorithm. 



An Example Assume that the target predicate is p(xi, X2, X3, CC4) <— {x\ = 
1 A X2 = 1) V (xs = 1 A a;4 = 1) V {x^ = 1 A X4 = 2), and that we are given 

100 positive and 100 negative instances of the target predicate, i.e. |E+| = 100 

and \E~\ = 100. Assume further that our specialisation operator is defined as 
a{h ^ 61 A ... A bn) = {h ^ 61 A ... A A a; = c|a; is a variable in h and 
c G {1 , . . . ,4}}. Now assuming that Reconsider-and-Conquer starts with the in- 
itial rule ro = p{xi,X2,X3,X4), Find-Rules recursively generates a sequence of 
more specialised rules, say: 

ri = p(a;i,a;2,a;3,a;4) ^ X 3 = I \E+\ = 50 |E“ | = 10 

T2 = p(xi,X2,X3,X4) ^ X3 = 1 A X4 = 1 |E+| = 25 \E~J = 0 

where the last rule is included in the hypothesis. 
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Find-Rules then updates the stack by removing all covered (positive) ex- 
amples and keeping only the bottom part of the stack that corresponds to a 
sequence of specialisation steps that fulfills the condition of a non-decreasing 
fraction of covered positive examples. In this case, the bottom element ri is 
kept as it covers 25 positive and 10 negative examples compared to 75 positive 
and 100 negative examples that are covered by the initial rule. So in contrast 
to Separate-and-Conquer, Reconsider-and-Conquer does not restart the search 
from the initial rule but continues from rule r\ and finds a specialisation that 
does not decrease the fraction of positive examples, say: 

r-i = ^ X3 = 1 A X4 = 2 |if+ | = 25 |if“ | = 0 

After the stack is updated, no rules remain and hence Reconsider-and-Con- 
quer restarts the search from an initial rule, and may choose any specialisa- 
tion without being restricted by the earlier choices. This contrasts to Divide- 
and-conquer, which would have had to choose some of the other rules in the 
(non-overlapping) split from which ri was taken (e.g. {p{xi,X 2 ,X 3 ,X 4 ) <— 0:3 = 
2,p(xi, X 2 , 2 : 3 , CC 4 ) ^ X 3 = 3 ,p{xi,X 2 ,X 3 ,X 4 ) ^ X 3 = 4}). Assuming the same 
initial rule (ro) is chosen again, the sequence of rules produced by Find-Rules 
may look like: 



T4 = p{xi,X2,X3,X4) ^ Xi = 1 |A+| = 50 = 20 

Tb = p(xi,X2,X3,X4) = IAX 2 = 1 \E+J =50 \E~J =0 

The last rule is included in the hypothesis and now all positive examples are 
covered so Reconsider-and-conquer terminates. Hence, the resulting hypothesis 
is: 



T 2 = p{xi,X2,X3,X4) ^ X 3 = 1 A 2:4 = 1 

rs = p{xi,X 2 ,X 3 ,X 4 ) ^ X 3 = 1 A 2 J 4 = 2 

rs = p{xi,X2,X3,X4) ^ 2:1 = 1 A2J2 = 1 



It should be noted that Divide-and-conquer has to induce seven rules to ob- 
tain a hypothesis with identical coverage, as the disjunct that corresponds to 
above has to be replicated in five rules: 



^( 2 : 1 , 2 : 2 , 2 : 3 , 24 ) 

^(21,22,23,24) 

^(21,22,23,24) 

^(21,22,23,24) 

^(21,22,23,24) 

^(21,22,23,24) 

^(21,22,23,24) 



23 = 1 A 24 = 1 
23 = 1 A 24 = 2 

23 = 1A24=3A2i = 1A22 = 1 
23 = 1A24=4A2i = 1A22 = 1 
23 = 2 A 2i = 1 A 22 = I 
23 = 3 A 2i = 1 A 22 = I 
23 = 4 A 2i = 1 A 22 = 1 
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4 Empirical Evaluation 

We first describe how the experiments were performed and then present the 
experimental results. 



4.1 Experimental Setting 



Reconsider-and-Conquer (RAC) was compared to Divide-and-Conquer (DAC) 
and Separate-and-Conquer (SAC) in several propositional, numerical and rela- 
tional domains. The domains were obtained from the UCI repository ex- 
cept for two relational domains: one consists of four sets of examples regarding 
structure-activity comparisons of drugs for the treatment of Alzheimer’s desease, 
and was obtained from Oxford University Computing Laboratory and the other 
domain is about learning the definite clause grammar (DCG) in 0 p 455] 0 
All three algorithms were tested both with information gain heuristics and 
probability metrics based on the hypergeometric distribution, which for DAC 
are those given in m and H2| respectively, while the information gain heuristic 
modified for SAC (and RAC) is taken from pTl p 165], and the modified version 
of the probability metric in ^2] for SAC and RAC is: 



P{\E+\,\E-\,\E+\,\E-\) 




f\E+UE-\\ 

\\E+UE-,\J 



where r' G cr(r). The specialisation r' of r with lowest probability is chosen from 
a{r) given that 

\Et,\ ^ \E+\ 

\E+\JE~\ - \E+\JEr\ 

Following 1^, cut-points for continuous- valued attributes were chosen dyna- 
mically from the boundary points between the positive and negative examples 
in the training sets for the numerical and relational domains. 

An experiment was performed in each domain, in which the entire example 
set was randomly split into two partitions corresponding to 90% and 10% of 
the examples respectively. The larger set was used for training and the smaller 
for testing. The same training and test sets were used for all algorithms. Each 
experiment was iterated 30 times and the mean accuracy on the test examples as 

^ The set of positive examples consists of all sentences of up to 10 words that can 
be generated by the grammar (2869 sentences) and the equal sized set of negative 
examples was generated by applying the following procedure to each positive exam- 
ple: i) replace a randomly selected word in the sentence with a randomly selected 
word from the corpus, ii) go to step i with a probability of 0.5 and iii) restart the 
procedure if the resulting sentence is a positive example. 
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well as the amount of work performed measured as the cpu tim^ are presented 
below. In addition, we also present the mean number of rules in the produced 
hypotheses. 



4.2 Experimental Results 

In Table I, we present the accuracy, cpu time and number of rules in the hypo- 
thesis produced by each algorithm using both the probability metrics and the 
information gain heuristics for all domains. The domains within the first group 
in the table are propositional, the domains within the second group are numeri- 
cal, and the domains in the last group are relational. For all accuracies, bold face 
indicates that there is a statistically significant difference between the method 
and some less accurate method and no significant difference between the method 
and some more accurate method (if any). Furthermore, underline indicates that 
there is a statistically significant difference between the method and some more 
accurate method and no significant difference between the method and some less 
accurate method (if any). 

The accuracies of the three methods are shown in columns 2-4. To summarise 
these results, one can see that DAC has a best /worst score (as indicated by bold 
and underline in the table) of 3/10 (5/8) (the first score is for the probability 
metrics and the second is for the information gain heuristics) . The corresponding 
score for SAC is 9/3 (8/3) and for RAC 8/3 (8/2). Looking more closely at the 
domains, one can see that there is a significant difference in accuracy between 
DAC and SAC in favour of the latter in those (artificial) domains in which 
the replication problem was expected to occur (Tic-Tac-Toe, KRKI) but also 
in several of the other (natural) domains (most notably Student loan and Alzh. 
chol.). One can see that RAC effectively avoids the replication problem in these 
domains and is almost as accurate as, or even more accurate than, SAC. 

DCG is the only relational domain in which DAC is significantly more accu- 
rate than SAC and RAC (although the difference is small). In this domain the 
replication problem is known not to occur since the shortest correct grammar 
is within the hypothesis space of DAC. The difference in accuracy can here be 
explained by the different versions of the probability metrics and information 
gain heuristics that are used for DAC and SAC/RAC. For DAC these reward 
splits that discriminate positive from negative examples while for SAC/RAC 
they reward rules with a high coverage of positive examples. 

In columns 5-7, the cputime of the three methods are shown. The median for 
the cpu time ratio SAC/DAC is 5.68 (5.77) and for the cpu time ratio RAC/DAC 

® The amount of work was also measured by counting the number of times it was 
checked whether or not a rule covers an example, which has the advantage over the 
former measure that it is independent of the implementation but the disadvantage 
that it does not include the (small) overhead of RAC due to handling the stack. 
However, both measures gave consistent results and we have chosen to present only 
the former. All algorithms were implemented in SICStus Prolog 3 and were 
executed on a SUN Ultra 60, except for the numerical domains which were executed 
on a SUN Ultra I. 
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Domain 


Accuracy (percent) 


Time (seconds) 


No 


. of rules 1 




DAC 


SAC 


RAC 


DAC 


SAC 


RAC 


DAC 


SAC 


RAC 


Shuttle 


97.98 


99.17 


99.17 


0.13 


0.34 


0.17 


9.9 


5.8 


5.8 




98.33 


99.17 


99.17 


0.13 


0.33 


0.16 


8.1 


5.8 


5.8 


Housevotes 


93.26 


94.11 


93.95 


1.78 


4.98 


2.29 


17.1 


13.8 


15.1 




93.64 


93.95 


93.64 


1.73 


5.00 


2.20 


16.7 


13.9 


15.4 


Tic-Tac-Toe 


85.59 


99.58 


99.03 


2.04 


7.10 


4.13 


108.3 


13.0 


21.2 




85.63 


99.55 


99.13 


1.74 


6.74 


3.96 


107.2 


12.4 


20.3 


KRvsKP 


99.58 


99.20 


99.30 


96.90 


379.77 


119.93 


18.5 


17.4 


16.2 




99.59 


99.18 


99.23 


95.77 


386.40 


145.31 


18.2 


18.1 


20.4 


Splice (n) 


92.97 


91.67 


92.41 


284.76 


3508.04 


904.60 


163.9 


82.7 


118.8 




92.88 


90.07 


91.88 


278.53 


3636.01 


897.20 


162.7 


84.9 


123.4 


Splice (ei) 


95.29 


96.32 


96.23 


208.49 


2310.56 


318.57 


79.6 


30.6 


42.9 




95.43 


96.04 


96.09 


204.64 


2344.55 


311.55 


77.4 


31.7 


42.9 


Splice (ie) 


93.77 


93.29 


93.55 


243.76 


4109.09 


375.68 


111.8 


47.4 


67.1 




94.10 


93.01 


93.51 


235.31 


3980.19 


419.59 


106.7 


46.9 


67.3 


Monks-2 


66.44 


70.28 


73.89 


2.03 


31.73 


5.31 


129.2 


107.3 


108.7 




72.44 


68.78 


73.11 


0.88 


31.52 


5.00 


121.0 


107.9 


109.1 


Monks-3 


96.79 


96.42 


96.42 


0.69 


3.98 


1.15 


27.1 


23.7 


23.7 




97.03 


96.30 


96.48 


0.52 


3.87 


1.12 


26.4 


23.6 


23.8 


Bupa 


62.25 


65.59 


65.00 


39.62 


192.75 


75.88 


37.4 


25.1 


29.4 




63.92 


65.69 


65.88 


43.00 


193.31 


74.13 


36.3 


26.9 


31.9 


Ionosphere 


88.10 


86.86 


88.29 


1201.02 


1965.23 


1273.14 


9.1 


5.8 


7.2 




88.19 


86.57 


87.05 


1371.10 


1916.97 


1035.10 


8.3 


7.7 


9.6 


Pima Indians 


72.73 


71.26 


71.95 


323.73 


3079.90 


587.48 


57.5 


38.4 


45.1 




72.60 


71.99 


70.39 


329.26 


2756.38 


537.69 


56.4 


40.2 


49.9 


Sonar 


71.27 


71.90 


73.49 


2049.23 


3484.61 


2319.63 


9.7 


6.5 


8.5 




72.70 


74.13 


74.60 


2123.46 


3247.45 


2254.65 


9.3 


6.8 


9.4 


WDBC 


91.40 


94.09 


91.99 


4872.89 


6343.40 


4885.81 


9.1 


6.6 


8.4 




91.99 


93.63 


92.63 


4870.71 


6656.17 


4758.54 


8.9 


6.7 


10.2 


KRKI 


98.03 


98.90 


99.17 


50.37 


101.90 


55.46 


26.8 


9.2 


13.0 




98.07 


98.90 


99.13 


50.08 


101.75 


55.39 


26.7 


9.4 


12.7 


DCG 


99.99 


99.93 


99.92 


43.83 


205.76 


129.40 


28.0 


28.8 


28.5 




99.99 


99.97 


99.95 


44.17 


204.05 


130.55 


28.0 


28.1 


28.1 


Student loan 


90.53 


96.57 


96.57 


10.72 


60.94 


21.00 


113.4 


35.9 


44.0 




92.03 


96.83 


96.40 


10.13 


58.50 


21.48 


96.7 


36.7 


44.7 


Alzh. toxic 


77.53 


81.50 


81.84 


238.27 


1581.34 


406.28 


198.4 


83.5 


92.7 




77.68 


81.54 


81.91 


195.89 


1593.38 


417.64 


190.9 


84.8 


95.4 


Alzh. amine 


86.14 


85.60 


85.56 


208.13 


2146.01 


337.96 


148.8 


81.6 


92.0 




85.89 


85.85 


84.15 


162.83 


1854.61 


330.25 


144.8 


82.6 


94.0 


Alzh. scop. 


56.15 


59.84 


60.21 


215.50 


2673.01 


501.71 


263.1 


124.5 


135.4 




55.47 


60.99 


60.36 


195.14 


2259.70 


528.71 


261.5 


121.4 


132.3 


Alzh. chol. 


64.76 


75.06 


74.99 


452.24 


10148.40 


1063.30 


450.4 


228.2 


240.8 




64.86 


75.54 


75.31 


364.41 


9980.11 


1103.79 


438.8 


226.0 


244.9 



Table 1. Accuracy, cpu time, and number of rules using probability metrics (first line) 
and information gain (second line). 
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it is 1.71 (1.78). The domain in which the SAC/RAC ratio is highest is Splice(ie) 
where the ratio is 10.94 (9.5). Except for Monks-2, the domain for which the 
SAC/DAC ratio is highest is Alz. chol., where it is 27.31 (22.44). The RAC/DAC 
ratio in this domain is 3.03 (2.35). In Monks-2, the SAC/DAC ratio is even higher 
for the probability metrics (35.82) but lower for the information gain heuristics 
(15.63). The RAC/DAC ratio reaches its highest value in this domain (5.68 for 
the probability metric). 

In columns 8-10, the average number of rules produced are shown. DAC 
produces more rules than SAC in all domains except one (DCG). The median 
for the number of rules produced by SAC is 28.8 (28.1), for RAC 29.4 (31.9) 
and for DAC 57.5 (56.4). These results are consistent with the fact that SAC 
and RAC search a larger hypothesis space than DAC in which more compact 
hypotheses may be found. 

In summary, both RAC and SAC outperform DAC in most domains tested in 
the experiments, mainly due to the effective handling of the replication problem. 
But although RAC is about as accurate as SAC, it is up to an order of magnitude 
faster. 

5 Related Work 

Two previous systems, IREP ^ and RIPPER 0, are able to efficiently process 
large sets of noisy data despite the use of Separate-and-Conquer. The main 
reason for this efficiency is the use of a technique called incremental reduced error 
pruning, which prunes each rule immediately after it has been induced, rather 
than after all rules have been generated. This speeds up the induction process as 
the pruned rules allow larger subsets of the remaining positive examples to be 
removed at each iteration compared to the non-pruned rules. It should be noted 
that this technique could also be employed directly in Reconsider-and-Conquer, 
improving the efficiency (and accuracy) further, especially in noisy domains. 

Like Reconsider-and-Conquer, a recently proposed method for rule induction, 
called PART, also employs a combination of Divide-and-Conquer and Separate- 
and-Conquer |0|. One major difference between PART and Reconsider-and- 
Conquer is that the former method uses Divide-and-Conquer to find one rule 
that is added to the resulting hypothesis, while the latter method uses (a ge- 
neralised version of) Divide-and-Conquer for generating a set of rules that is 
added. The purpose of the former method to use Divide-and-Conquer is not to 
gain efficiency over Separate-and-Conquer, but to avoid a problem called hasty 
generalisation that may occur when employing incremental reduced error pru- 
ning, like IREP and RIPPER do. Again, the former method may in fact be used 
as a pruning technique in conjunction with Reconsider-and-Conquer rather than 
Separate-and-Conquer . 

In C4.5rules uni, a set of rules is first generated using Divide-and-Conquer, 
and then simplified by a post-pruning process. However, the cost of this process 
is cubic in the number of examples 0 , which means that it could be even more 
expensive than using Separate-and-Conquer in the first place to overcome the re- 
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plication problem. Still, the post-pruning techniques employed by C4.5rules (and 
other systems e.g. RIPPER) could be useful for both Separate-and-Conquer as 
well as Reconsider-and-Conquer. The main advantage of using Reconsider-and- 
Conquer for generating the initial set of rules compared to Divide-and-Conquer 
as used in C4.5rules is that significantly fewer rules need to be considered by the 
post-pruning process when having employed the former. 

There have been other approaches to the replication problem within the 
framework of decision tree learning. One approach is to restrict the form of the 
trees when growing them, which then allows for merging of isomorphic subtrees 
P). It should be noted that these techniques are, in contrast to Reconsider-and- 
Conquer, yet restricted to propositional and numerical domains. 

6 Concluding Remarks 

A hybrid strategy of Divide-and-Conquer and Separate-and-Conquer has been 
presented, called Reconsider-and-Conquer. Experimental results from proposi- 
tional, numerical and relational domains have been presented demonstrating 
that Reconsider-and-Conquer significantly reduces the replication problem from 
which Divide-and-Conquer suffers and that it is several times (up to an order 
of magnitude) faster than Separate-and-Conquer. In the trade-off between accu- 
racy and amount of cpu time needed, we find Reconsider-and-Conquer in many 
cases to be a very good alternative to both Divide-and-Conquer and Separate- 
and-Conquer. 

There are a number of directions for future research. One is to explore both 
pre- and post-pruning techniques in conjunction with Reconsider-and-Conquer. 
The techniques that have been developed for Separate-and-Conquer can in fact 
be employed directly as mentioned in the previous section. Another direction is 
to investigate alternative conditions for the decision made by Reconsider-and- 
Conquer regarding whether or not the search should continue from some rule 
on a completed branch. The currently employed condition that the fraction of 
covered positive examples should never decrease worked surprisingly well, but 
other conditions, e.g. based on some significance test, may be even more effective. 
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Abstract. Most ILP systems employ the covering algorithm whereby 
hypotheses are constructed iteratively clause by clause. Typically the covering 
algorithm is greedy in the sense that each iteration adds the best clause 
according to some local evaluation criterion. Some typical problems of the 
covering algorithm are: unnecessarily long hypotheses, difficulties in handling 
recursion, difficulties in learning multiple predicates. This paper investigates a 
non-covering approach to ILP, implemented as a Prolog program called 
HYPER, whose goals were: use intensional background knowledge, handle 
recursion well, and enable multi-predicate learning. Experimental results in this 
paper may appear surprising in the view of the very high combinatorial 
complexity of the search space associated with the non-covering approach. 



1 Introduction 

Most ILP systems employ the covering algorithm whereby hypotheses are induced 
iteratively clause by clause. Examples of such systems are Quinlan’s FOIL [5], 
Grobelnik’s Markus [2], Muggleton’s PROGOL [3] and Pompe’s CLIP [4]. The 
covering algorithm builds hypotheses gradually, starting with the empty hypothesis 
and adding new clauses one by one. Positive examples covered by each new clause 
are removed, until the remaining positive examples are reduced to the empty set, that 
is, the clauses in the hypothesis cover all the positive examples. 

Typically the covering algorithm is greedy in the sense that on each iteration it 
chooses to add the clause that optimises some evaluation criterion. Such a clause tends 
to be optimal locally, with respect to the current set of clauses In the hypothesis. 
However there is no guarantee that the covering process will result in a globally 
optimal hypothesis. A good hypothesis is not necessarily assembled from locally 
optimal clauses. On the other hand, locally inferior clauses may cooperate well as a 
whole, giving rise to an overall good hypothesis. 

Some typical problems due to the greedy nature of the covering algorithm are: 

• Unnecessarily long hypotheses with too may clauses 

• Difficulties in handling recursion 

• Difficulties in learning multiple predicates simultaneously 
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In view of these difficulties a non-covering approach where a hypothesis (with all its 
clauses) would be constructed as a whole, would seem to be a better idea. Of course a 
strong practical reason against this, and in favour of the covering approach, is the 
combinatorial complexity involved. The local optimisation of individual clauses is 
complex enough, so the global optimisation of whole hypotheses would seem to be 
out of question. Experimental results in this paper are possibly surprising in this 
respect. 

In this paper we investigate a non-covering approach and study its performance 
on typical small ILP learning problems that require recursion. We develop such a non- 
covering algorithm, implemented as a Prolog program called HYPER (Hypothesis 
Refiner, as opposed to the majority of ILP programs that are "clause refiners"). The 
design goals of the HYPER program were: 

• simple, transparent and short 

• handle intensional background knowledge 

• handle recursion well 

• enable multipredicate learning 

• handle reasonably well typical basic ILP benchmark problems (member/2, 
append/3, path/3, sort/2, arches/3, etc.) without having to resort to special tricks, 
e.g. unnatural declaration of argument modes 

HYPER does not address the problem of noisy data. So it aims at inducing short 
hypotheses that are consistent with the examples, that is: cover all the positive 
examples and no negative one. 

2 Mechanisms of HYPER 

According to the design goal of simplicity, the program was developed in the 
following fashion. First, an initial version was written which is just a simple generator 
of possible hypotheses for the given learning problem (i.e. bakground predicates and 
examples). The search strategy in this initial version was simple iterative deepening. 
Due to its complexity, this straightforward approach, as expected, fails even for 
simplest learning problems like member/2 or append/3. Then additional mechanisms 
were added to the program, such as better search and mode declarations, to improve 
its efficiency. Thirteen versions were written in this way with increasingly better 
performance. Once a reasonable performance on the basic benchmark problems was 
obtained, the added mechanisms were selectively switched off to see which of them 
were essential. Eventually, the final version was obtained with the smallest set of 
added mechanisms which still performs reasonably well. The mechanisms that were 
experimentally found to be essential are described in detail below. Later we also 
discuss mechanisms that somewhat surprizingly proved not to be particularly useful. 
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2.1 Search 

HYPER constructs hypotheses in the top-down fashion by searching a refinement tree 
in which the nodes correspond to hypotheses. A hypothesis Ho in this tree has 
successors Hi, where hypotheses Hi are the least specific (in some sense) refinements 
of of Ho- Refinements are defined by HYPER’ s refinement operator described in the 
next section. Each newly generated hypothesis is more specific than or equal to its 
predecessor in the sense of theta-subsumption. So a hypothesis can only cover a subset 
of the examples covered by the hypothesis’ predecessor. The learning examples are 
always assumed noise-free, and the goal of search is to find a hypothesis that is 
consistent with the examples. That is, it covers all the positive examples and no 
negative example. If a hypothesis is generated that does not cover all the positive 
examples, it is immediately discarded because it can never be refined into a consistent 
hypothesis. Excluding such hypotheses from the search tree reduces the tree 
considerably. During search, new hypotheses are not checked whether they are 
duplicates or in any sense equivalent to already generated hypotheses. 

Search starts with a set of initial hypotheses. This set is the set of all possible bags 
of user-defined start clauses of up to the user-defined maximal number of clauses in a 
hypothesis. Multiple copies of a start clause typically appear in a start hypothesis. A 
typical start clause is something rather general and neutral, such as: append( LI, L2, 
L3). 

HYPER performs a best-first search using an evaluation function that takes into 
account the size of a hypothesis and its accuracy in a simple way by defining the cost 
of a hypothesis H as: 

Cost( H) = Wi * Size(H) H- W 2 * NegCover(H) 

where NegCover(H) is the number of negative examples covered by H. The definition 
of ‘H covers example E’ in HYPER roughly corresponds to ‘E can be logically 
derived from H’. There are however some essential procedural details described later. 
Wi and W 2 are weights. The size of a hypothesis is defined simply as a weighted sum 
of the number of literals and number of variables in the hypothesis: 

Size(H) = ki * #literals(H) H- k 2 * #variables(H) 

All the experiments with HYPER described later were done with the following 
settings of the weights: Wi=l, W2=10, ki=10, k 2 =l, which corresponds to: 

Cost(H) = #variables(H) -i- 10 * #literals(H) H- 10 * NegCover(H) 

These settings are ad hoc, but their relative magnitudes are intuitively justified as 
follows. Variables in a hypothesis increase its complexity, so they should be taken 
into account. However, the literals increase the complexity more, hence they 
contribute to the cost with a greater weight. A covered negative example contributes 
to a hypothesis’ cost as much as a literal. This corresponds to the intuition that an 
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extra literal should at least prevent one negative example from being covered. It 
should be noted that these weights can be varied considerably withought affecting the 
search performance. For example, changing ki from 1 to 5 had no effect on search in 
the experiments. 



2.2 Hypothesis Refinement 

To refine a clause, perform one of the following: 

1. Unify two variables in the clause, e.g. XI = X2. 

2. Refine a variable in the clause into a background term, e.g. replace variable LO 
with term [XIL]. 

3. Add a background literal to the clause. 

Some pragmatic details of these operations are as follows: 

(a) The arguments of literals are typed. Only variables of the same type can be unified. 
The user defines background knowledge, including „back-literals“ and „back-terms“. 
Only these can be used in refining a variable (of the approapriate type) into a term, 
and in adding a literal to a clause. 

(b) Arguments in back-literals can be defined as input or output. When a new literal is 
added to a clause, all of its input arguments have to be unified (non-deterministically) 
with the existing non-output variables in the clause (that is those variables that are 
assumed to have their values instantiated at this point of executing the clause). 

To refine a hypothesis Hq, choose one of the clauses Co in Hq, refine clause Co into C, 
and obtain a new hypothesis H by replacing Co in Ho with C. This says that the 
refinements of a hypothesis are obtained by refining any of its clauses. There is a 
useful heuristic that often saves complexity. Namely, if a clause is found in Ho that 
alone covers a negative example, then only refinements arising from this clause are 
generated. The reason is that such a clause necessarily has to be refined before a 
consistent hypothesis is obtained. This will be referred to as "covers-alone heuristic". 

This refinement operator aims at producing least specific specialisations (LSS). 
However, it really only approximates LSS. This refinement operator does LSS under 
the constraint that the number of clauses in a hypothesis after refinement stays the 
same. Without this restriction, an LSS operator should be more appropriately defined 
as: 



refs_hyp(Ho) = { Ho-{Co} U refs_clause(Co) I Co e Hq} 

where refs_hyp(Ho) is the set of all LSS of hypothesis Hq, and refs_clause(Co) is the 
set of all LSS of clause Co- This unrestricted definition of LSS was not implemented 
in HYPER for the obvious reason of complexity. 
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2.3 Interpreter for Hypotheses 

To prove an example, HYPER uses intensional background knowledge together with 
the current hypothesis. To this end HYPER includes a Prolog meta-interpreter which 
does approximately the same as the Prolog interpreter, but takes care of the possibility 
of falling into infinite loops. Therefore the length of proofs is limited to a specified 
maximal number of proof steps (resolution steps). This was set to 6 in all the 
experiments mentioned in this paper. It is important to appropriately handle the cases 
where this bound is exceeded. It would be a mistake to interpret such cases simply as 
’fail’. Instead, the following interpretation was designed and proved to be essential for 
the effectiveness of HYPER. The interpreter is implemented as the predicate: 

prove( Goal, Hypo, Answer) 

Goal is the goal to be proved using the current hypothesis Hypo and background 
knowledge. The predicate prove/3 always succeeds and Answer can be one of the 
following three cases: 

Answer = yes if Goal is derivable from Hypo in no more than D steps (max. proof 
length) 

Answer = no if Goal is not derivable even with unlimited proof length 
Answer = maybe if proof search was terminated after D steps 

The interpretation of these answers, relative to the standard Prolog interpreter, is as 
follows, ’yes’ means that Goal under the standard interpreter would definitely succeed, 
’no’ means that Goal under the standard interpreter would definitely fail, ’maybe’ 
means any one of the following three possibilities: 

1 . The standard Prolog interpreter (no limit on proof length) would get into infinite 
loop. 

2. The standard Prolog interpreter would eventually find a proof of length greater 
than D. 

3. The standard Prolog interpreter would find, at some length greater than D, that this 
derivation alternative fails. Therefore it would backtrack to another alternative and 
there possibly find a proof (of length possibly no greater than D), or fail, or get 
into an infinite loop. 

The question now is how to react to answer ’maybe’ when processing the learning 
examples. HYPER reacts as follows: 

• When testing whether a positive example is covered, ’maybe’ is interpreted as ’not 
covered’. 

• When testing whether a negative example is not covered, ’maybe’ is interpreted as 
not ’not covered’, i.e. as ’covered’. 
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A clarification is in order regarding what counts as a step in a proof. Only resolution 
steps involving clauses in the current hypothesis count. If backtracking occurs before 
proof length is exceeded, the backtracked steps are discounted. When proving the 
conjunction of two goals, the sum of their proof lengths must not exceed max. proof 
length. Calls of background predicates, defined in Prolog, are delegated to the 
standard Prolog interpreter and do not incur any increase in proof length. It is up to the 
user to ensure that the standard Prolog interpreter does not get into an infinite loop 
when processing such "background" goals. 



2.4 Example 

HYPER is implemented in Prolog. In this implementation, a hypothesis is represented 
as a list of clauses. A clause is a list of literals accompanied by a list of variables and 
their types. For example: 

[memberC XO, [xl | Ll]), memberC XO, L2)] / 

[X0:item, xl:item, Llilist, L2:list] 

corresponds to the Prolog clause: 

memberC xO, [xl | Ll]) memberC xO, L2) . 

where the variables XO and Xl are of type item, and Ll and L2 are of type list. The 
types are user-defined. 

Figure 1 shows the specification accepted by HYPER of the problem of learning two 
predicates simultaneously: even(L) and odd(L), where even(L) is true if L is a list with 
an even number of elements, and odd(L) is true if L is a list with an odd number of 
elements. In this specification, 

backliteralC evenC L) , [L:list], []). 

means: even(L) can be used as a background literal when refining a clause. The 
argument L is of type list. L is an input argument; there are no oputput arguments. 
prolog_predicate( fail) means that there are no background predicates defined in 
Prolog. The predicate term/3 specifies how variables of given types (in this case list) 
can be refined into terms, comprising variables of specified types. Start hypotheses are 
all possible bags of up to max_clauses = 4 start clauses of the two forms given in Fig. 
1. For this learning problem HYPER finds the following hypothesis consistent with 
the data: 

evenC [ ] ). 

evenCL A, B | c ] ) evenC c) . 

oddC [ A I B ] ) evenC B) . 
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Before this hypothesis is found, HYPER has generated altogether 66 hypotheses, 
refined 16 of them, kept 29 as further candidates for refinement, and discarded the 
remaining 21 as incomplete (i.e. not covering all the positive examples). To reach the 
final hypothesis above from the corresponding start hypothesis, 6 refinement steps are 
required. The size of the complete refinement forest of up to 6 refinement steps from 
the start hypotheses in this case, respecting the type constraints as specified in Fig. 1, 
is 22565. The actual number of hypotheses generated during search was thus less than 
0.3% of the total refinement forest to depth 6. 

This learning problem can also be defined more restrictively by only allowing 
term refinements on lists to depth 1 only, thus suppressing terms like [XI, X2 I L]. 
This can be done by using type "list(Depth)" in the definition of refine_term/3 and 
start_clause/l as follows: 

term( listC D) , [ X | L ], [ Xiitem, L:list(l) ] ) 
var( D) . % list(l) cannot be refined further! 

term( listC D) , [ ] , [ ]) . 

start_clause( [ odd( L) ] / [ L:list( D) ] ) . 

start_cl auseC [ evenC L) ] / [ LilistC D) ] ) . 

Using this problem definition, HYPER finds the mutually recursive definition of 
even/2 and odd/2: 

even( [ ] ) . 

oddC [A I B ] ) evenC B ). 

even( [ A | B ] ) odd( B ). 



2.5 Mechanisms that Did Not Help 

Several mechanisms were parts of intermediate versions of HYPER, but were 
eventually left out because they were found not to be clearly useful. They did not 
significantly improve search complexity and at the same time incurred some 
complexity overheads of their own. Some of these mechanisms are mentioned below 
as possibly useful „negative lessons": 

• Immediately discard clause refinements that render the corresponding hypothesis 
unsatisfiable (i.e. cannot succeed even on the most general query). 

• Checking for redundant or duplicate hypotheses where „duplicate“ may mean 
either literally the same under the renaming of the variables, or some kind of 
equivalence between sets of clauses and literals in the clauses, or redundancy 
based on theta-subsumption between hypotheses. One such idea is to discard those 
newly generated and complete hypotheses that subsume any other existing 
candidate hypothesis. This requires special care because the subsuming hypothesis 
may later be refined into a hypothesis that cannot be reached from other 
hypotheses. Perhaps surprizingly no version of such redundancy or duplicate test 
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was found that was clearly useful. Also it is hard to find a useful subsumption- 
based test that would correspond well to the procedurally oriented interpretation of 
hypotheses. 



% Inducing odd and even length property for lists 
% Background literals 

backliteralC evenC L) , [ L:list], [ ]). 
backliteralC odd( L) , [ L:list], [ ]). 

% Term refinements 

term( list, [X | L ], [ X:item, L:list]). 
term( list, [ ] , [ ]) . 

% Background predicates defined in Prolog 

prol og_predi cate ( none). % no background predicate in Prolog 
% start clauses 

start_clause( [ oddC L) ] / [ L:list]). 
start_clause( [ evenC L) ] / [ L:list]). 

% Positive examples 

ex( evenC [ ] ) )■ 
ex( evenC [a,b] ) ). 
exC OddC [a] ) ). 
exC OddC [b,c,d] ) ). 
exC oddC [a,b,c,d,e] ) ) . 
exC evenC [a,b,c,d] ) ). 

% Negative examples 

nexC evenC [a] ) ) . 
nexC evenC [a,b,c] ) ) . 
nexC OddC [ ] ) )■ 
nexC oddC [a, b] ) ) . 
nexC oddC [a,b,c,d] ) ) . 



Fig. 1: Definition of the problem of learning even/1 and odd/1 simultaneously. 
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• „Closing“ induced clauses as in Markus [2] to avoid the problem with uninstanted 
outputs of a target predicate. Closing a hypothesis means to „connect“ all the 
output arguments to instantiated variables (i.e. unifying all the output arguments 
with other arguments). There are typically many ways of closing a hypothesis. So 
in evaluating a hypothesis, a „best“ closing would be sought (one that retains the 
completeness and covers the least negative examples). This is, however, not only 
combinatorially inefficient, but requires considerable elaboration because often the 
closing is not possible without making the hypothesis incomplete (closing is 
sometimes too coarse a specialisation step). 



3 Experiments 

Here we describe experiments with HYPER on a set of typical ILP test problems. All 
of them, except arches/3, concern relations on lists and require recursive definitions. 
The problems are: 

member(x, Li st) 
appendCLi stl, Li st2 , Li st3) 

even (List) + odd (Li st) ( learning simultaneously even and odd length list) 
path (startNode , Goal Node , Path) 
i nsort(Li st , SortedLi st) (insertion sort) 

arch (Bl ockl , Bl ock2 , Bl ock3) (Winston’s arches with objects taxonomy) 
i nvari ant (A, B ,Q , R) (Bratko and Grobelnik's program loop invariant [1]) 

For all of these problems, correct definitions were induced from no more than 6 
positive and 9 negative examples in execution times typically in the order of a second 
or a few seconds with Sicstus Prolog on a 160 MHz PC. No special attention was paid 
to constructing particularly friendly example sets. Smaller example sets would 
possibly suffice for inducing correct definitions. Particular system settings or mode 
declarations to help in particular problems were avoided throughout. Details of these 
experiments are given in Table 1. The definitions induced were as expected, with the 
exception of a small surprise for path/3. The expected definition was: 

pathC A, A, [A]). 

pathC A, B, [A I C]) 

linkC A, D) , path( D, B, c) . 

The induced definition was slightly different: 

pathC A, A, [A]). 

pathC C, D, [C, B I A]):- 

linkC C, B), pathC B, D, [E | A]). 




Refining Complete Hypotheses in ILP 53 



This might appear incorrect because the last literal would normally be programmed as 
path( B, D, [B I A]), stating that the path starts with node B and not with the undefined 
E. However, the heads of both induced clauses take care of eventually instantiating E 
to B. 

The main point of interest in Table 1 are the search statistics. The last five 
columns give: the refinement depth RefDepth (the number of refinement steps needed 
to construct the final hypothesis from a start hypothesis), the number of all generated 
hypotheses, the number of refined hypotheses and the number of candidate hypotheses 
waiting to be refined, and the total size of the refinement forest up to depth RefDepth. 
This size corresponds to the number of hypotheses that would have to be generated if 
the search was conducted in the breadth-first fashion. The number of all generated 
hypotheses is greater than the sum of refined and to-be-refined hypotheses. The reason 
is that those generated hypotheses that are not complete (do not cover all the positive 
examples) are immediately discarded. Note that all these counts include duplicate 
hypotheses because when searching the refinement forest the newly generated 
hypotheses are not checked for duplicates. The total size of the refinement forest is 
determined by taking into account the covers-alone heuristic. The sizes would have 
been considerably higher without this heuristic, as illustrated in Table 2. This table 
tabulates the size of the search forest and the size of the refinement forest by 
refinement depth, and compares these sizes with or without covers-alone heuristic and 
duplicate checking. The total sizes of the refinement forests in Table 1 were 
determined by generating these trees, except for append/3 and path/3. These trees 
were too large to be generated, so their sizes in Table 1 are estimates, obtained by 
extrapolating the exponential growth to the required depth. These estimates are 
considerable underestimates. 

Tables 1 and 2 indicate the following observations: 

• In most of these cases HYPER only generates a small fraction of the total 
refinement forest up to solution depth. 

• The losses due to no duplicate checking are not dramatic, at least for the tabulated 
case of member/2. Also, as Table 2 shows, these losses are largely alleviated by 
the covers-alone heuristic. 

Search in HYPER is guided by two factors: first, by the constraint that only complete 
hypotheses are retained in the search; second, by the evaluation function. In the cases 
of Table 1, the evaluation function guides the search rather well except for invariant/3. 

One learning problem, not included in Eig. 1, where it seems HYPER did not 
perform satisfactorily, is the learning of quick-sort. In this problem, the evaluation 
function does not guide the search well because it does not discriminate between the 
hypotheses on the path to the target hypothesis from other hypotheses. The target 
hypothesis emerges suddenly „from nothing". HYPER did induce a correct hypothesis 
for quick-sort with difference lists, but the definition of the learning problem (input- 
output modes) had to be defined in a way that was judged to be too unnatural 
assuming the user does not guess the target hypothesis sufficiently closely. Natural 
handling of such learning problems belongs to future work. 
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Table 1. Complexity of some learning problems and corresponding search statistics. "Total 
size" is the number of nodes up to Refine depth in the refinement forest defined by HYPER’ s 
refinement operator, taking into account the „covers-alone“ heuristic. 



Learning 

problem 


Backgr. 

pred. 


Pos. 

exam. 


Neg. 

exam. 


Refine 

depth 


Hypos. 

refined 


To be 
refined 


All ge- 
nerated 


Total 

size 


member 


0 


3 


3 


5 


20 


16 


85 


1575 


append 


0 


5 


5 


7 


14 


32 


199 


> lO’ 


even -l- odd 


0 


6 


5 


6 


23 


32 


107 


22506 


path 


1 


6 


9 


12 


32 


112 


658 


> 10*^ 


insort 


2 


5 


4 


6 


142 


301 


1499 


540021 


arches 


4 


2 


5 


4 


52 


942 


2208 


3426108 


invariant 


2 


6 


5 


3 


123 


2186 


3612 


18426 



Table 2. Complexity of the search problem for inducing member/2. D is refinement depth, N is 
the number of nodes in refinement forest up to depth D with covers-alone heuristic, N(uniq) is 
the number of unique such hypotheses (i.e. after eliminating duplicates); N(all) is the number of 
all the nodes in refinement forest (no covers-alone heuristic), N(all,uniq) is the number of 
unique such hypotheses. 



D 


N 


N(uniq) 


N(all) 


N(all,uniq) 


1 


3 


3 


6 


6 


2 


13 


13 


40 


31 


3 


50 


50 


248 


131 


4 


274 


207 


1696 


527 


5 


1575 


895 


12880 


2151 



4 Conclusions 

The paper investigates the refinement of complete hypotheses. This was 
experimentally investigated by designing the ILP program HYPER which refines 
complete hypotheses, not just clauses. It does not employ a covering algorithm, but 
constructs a complete hypothesis „simultaneously“. This alleviates problems with 
recursive definitions, specially with mutual recursion (when both mutually recursive 
clauses are needed for each of them to be found useful). The obvious worry with this 
approach is its increased combinatorial complexity in comparison with covering. The 
experimental results are possibly surprising in this respect as shown inTable 1. In most 
of the experiments in typical simple learning problems that involve recursion, 
hyper’s search heuristics cope well with the complexity of the hypothesis space. 
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The heuristics seem to be particularly effective in learning predicates with structured 
arguments (lists). On the other hand, the heuristics were not effective for invariant/4 
whose arguments are not structured, and background predicates are arithmetic. 

Some other useful properties of the HYPER approach are: 

1 . The program can start with an arbitrary initial hypothesis that can be refined to the 
target hypothesis. This is helpful in cases when the user’s intuition allows start the 
search with an initial hypothesis closer to the target hypothesis. 

2. The target predicate is not necessarily the one for which the examples are given. 
E.g., there may be examples for predicate p/1, while an initial hypothesis contains the 
clauses: q(X) and p(X) :- r(X,Y), q(Y). 

3. The program always does a general to specific search. This mono tonicity property 
allows to determine bounds on some properties of the reachable hypotheses. E.g. if H 
covers P positive examples and N negative examples, then all the hypotheses 
reachable through refinements will cover at most P and N examples respectively. 
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Abstract. In this paper, we present a new method based on nonmonotonic 
learning where the Inductive Logic Programming (ILP) algorithm is used twice 
and apply our method to acquire graphic design knowledge. Acquiring design 
knowledge is a challenging task because such knowledge is complex and vast. 
We thus focus on principles of layout and constraints that layouts must satisfy 
to realize automatic layout generation. Although we do not have negative 
examples in this case, we can generate them randomly by considering that a 
page with just one element moved is always wrong. Our nonmonotonic 
learning method introduces a new predicate for exceptions. In our method, the 
ILP algorithm is executed twice, exchanging positive and negative examples. 
From our experiments using magazine advertisements, we obtained rules 
characterizing good layouts and containing relationships between elements. 
Moreover, the experiments show that our method can learn more accurate rules 
than normal ILP can. 



1 Introduction 

As the World Wide Web (or simply "Web") has spread recently, more amateurs (not 
professional graphic designers) have begun to design web documents. This indicates 
the need to support them and is our motivation for studying how to acquire graphic 
design knowledge, or more generally, visual design knowledge. Our ultimate target is 
automated graphic design; to automatically generate a high-quality design conforming 
to the given intention from materials and intention of document. To realize this, we 
believe that knowledge-based approaches should be taken. There are two approaches 
for acquiring design knowledge. The first one is acquisition from experts, i.e., 
graphic designers. However, it is not easy to extract their expertise and to make it 
machine-processable. The other is to acquire design knowledge from examples. We 
think this is suitable for this domain. Lieberman pointed out that graphic designers 
communicate expert knowledge through specific example designs [Lieberman 95]. 

S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 56-67, 1999. 

© Springer- Verlag Berlin Heidelberg 1999 




Acquiring Graphic Design Knowledge with Nonmonotonic Inductive Learning 57 



There has been some previous work on acquiring design knowledge from 
examples. The first kind of such work uses a statistical approach. For example, 
Ishiba et al. [Ishiba 97] analyzed the relationships between human feelings and design 
attributes of simple business documents, using linear regression. Typical human 
feelings are "warm,” "simple," "powerful," and "unique." The design attributes 
include "the font used in the title" and "space between lines." In work of this kind, 
impressions are represented by a linear function of a fixed number of parameters. 

However, this approach can be applied only to designs of similar forms. In simple 
business documents, a page can be described by parameters as Ishiba did. However, 
in posters or advertisements, elements on the page differ for differing samples. For 
example, parameters for title could not be set because some sample would have no 
specific title or would have two titles. Thus, we need to model a page at the element 
level, not at the page level. In this case, a set of parameters is needed for each 
element, not only for a whole page. Thus a page requires a set of parameter sets. 
Since the number of parameters differs for samples, regression cannot be applied to 
this case. 

Another kind of work on acquiring design knowledge is based on command 
sequences. Mondrian [Lieberman 93, 96] is an example of this kind of system. 
Mondrian is a graphical editor that can learn new graphical procedures by example 
(through programming by demonstration). Mondrian records and generalizes 
procedures presented as sequences of graphical editing commands. However, the 
acquired knowledge is in procedural form, not in declarative form. The procedures 
learned are not sufficiently general; they can be used only on "analogous" examples. 
Moreover, existing documents cannot be used as examples for learning. 

In contrast to the two approaches above, we adopt Inductive Logic Programming 
(ILP) [Muggleton 91]. ILP can handle element-level page models. For example, the 
rule 

propertyl (Page) has (Page, Element) , 
attributel (Element) . 

states that a page is propertyl if the page has (at least) one element which is 
attributel. In this example, attributel is a parameter for each element, and 
a page can be described as a set of an arbitrary number of parameters such as 
attributel (elementl) , attributel (element2) ,... 

Moreover, since ILP uses logic programs as knowledge representation, complex 
and declarative design knowledge can be acquired. Since ILP can also learn relations, 
it can acquire design knowledge represented as relationships between elements on 
pages. In addition, we introduce our new methods where the ILP algorithm is used 
twice. Details are described later. 

The paper is organized as follows. Section 2 clarifies the precise scope of this 
study. The dataset used is described in Section 3; the positive and negative examples, 
in Section 4; and the background knowledge, in Sections 5. Section 6 presents our 
learning methods and the results of the experiments using three different algorithms. 
Finally, Section 7 concludes with a discussion and an outline of directions for further 
work. 
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2 Target 

Acquiring graphic design knowledge is a challenging task because the whole 
knowledge is complex and vast, so we limit the scope of our study. We think the 
knowledge can be divided roughly into two kinds, knowledge about layout and 
knowledge about other things (color, pattern, texture, etc.). Here we exclude 
typesetting knowledge since it has already been investigated a lot. The latter can be 
represented mainly using attributes and is considered to be covered mostly by 
statistical approaches. Therefore, we deal with the layout part. 

The layout knowledge can be divided into two kinds. 

(1) Principles of layout - Basic rules about goodness of layout, i.e. rules by which 
to judge that a particular layout is good or bad (or medium). For example, layouts 
generated by designers are good ones, and layouts consisting of randomly placed 
elements are bad ones. 

(2) Relationships between layout and feelings - Knowledge about design 
alternatives conforming to the layout principles. Different layouts convey different 
human feelings, such as dynamic, stable, powerful, rhythmical, motive, orderly, or 
adventurous. 

Knowledge type (1) can realize automated layout generation, whereas knowledge 
type (2) can realize evaluation only. Therefore, we focus on the principles of layout. 



3 Page Model and Its Representation 

To show that ILP is effective for acquiring design knowledge, we deal with 
advertisements which the statistical approaches cannot handle. More precisely, we 
deal with one-page advertisements in technology magazines, using such 
advertisements found in BYTE [BYTE 97] as our data. Since electronic data are not 
available, we have to make our dataset by hand. 

First, we model pages for our purpose. In our model, each element on a page, such 
as a block of text, a picture, or an illustration, is called an object. Note that an object 
is not uniquely identified while making a dataset. A page is considered a set of 
objects. For the sake of simplicity, we model an object as a circumscribing rectangle. 
Ruled lines and frames are ignored, as are background images. Figure 1 illustrates an 
example of our page model. The figure is derived from an actual advertisement page 
in BYTE. The large object (a) is actually two columns of text which flows around the 
overlapping illustration (b). 

The following features are extracted from objects (the last three ones are only for 
text): 

• type: text, image, or illustration. Illustrations include charts, graphs, or logos. 
Decorated or deformed text (usually a title) is considered an image. 
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Fig. 1. Example of our model 

• location: represented as p{L,U,R,D), where L, U are the X, Y coordinates of the 
upper left of the rectangle and R, D are the X, Y coordinates of the lower right of 
the rectangle. Coordinates are measured in millimeters. 

• number of columns. 

• alignment between text lines: left, right, center, or justify (just). Alignment for a 
single line is just. 

• font size: The size of font used most frequently, in points (= 1/4 mm). 

These features are represented as Prolog facts like 
smpl ( PagelD , Obj ectID , Obj ect ) . We give an example of the real 
representation of a page below: 

smpl (4,1, obj (text (1, just, 36) ,p(40,22,164,31) ) ) . 
smpl (4,2, obj ( illustration, p (74,34,132,119))) . 
smpl (4,3, obj ( illustration, p (141,33,174,63))) . 

smpl (4,13, obj (text (1, just, 8) ,p(86,259,117,261))) . 



4 Positive and Negative Examples 

In this task, we do not have the negative examples needed for ILP. Thus our case can 
be regarded as nonmonotonic inductive logic programming setting [De Raedt 93]. 

To generate negative examples, we now consider applying the Closed World 
Assumption (CWA). In other words, any layout different from the original one is 
considered a negative example. However, CWA does not always hold in the design 
domain. Thus, in our study we assume the situation that the last element is to be 
located to a page on which all other elements already have been located. In this 
situation, we think it is always wrong to place the last element in any location except 
the right location. In this way, we apply CWA and generate many negative examples. 
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For instance, in Figure 2, (a) shows a positive example, and (b), (c), and (d) show 
negative examples when the black object is to be located. 
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Fig. 2. Positive and negative examples 

The target predicate for our learning experiments is 
good_locatel (Case, Location) , where Case is a term 
case ( PagelD , Obj ectID) and Location is a term loc (X, Y) . This says that 
on the page PagelD, it is good to place the object Obj ectID at location (X,Y). For 
instance, a positive example is represented as 
good_locatel (case(4,l) , loc (40, 22) ). 



5 Background Knowledge 

To determine the background knowledge for our experiments, we investigated actual 
advertisement pages. First, we tried to detect and describe basic rules about goodness 
of layout from the pages as well as we could, consulting design textbooks. Next, we 
determined the predicates needed to describe such rules and defined the predicates, 
partly referring to [Honda 94]. Our background knowledge consisting of 15 
predicates and their definitions is shown below. 

For an object: 

• form_obj (Case, Location, Obj ) : From Case and Location, an object Obj is 

formed, referring to smpl. For example, 

f orm_obj (case(4,l) , loc (40, 22) , obj (text (1, just, 36) , p(40 

,22,124,144) ) ) holds when smpl is defined as shown in Section 3. 

• obj_type (Obj , Type) : The type of Obj is Type. 

• obj_width (Obj , Length) : The width of Obj is Length, in millimeters. 

• obj_height (Obj , Length) : The height of Obj is Length, in millimeters. 

• h_position (Obj , Pos) : The approximate horizontal position of Obj is Pos. 
Pos is left, middle, or right. 
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• v_position (Obj , Pos) : The approximate vertical position of Obj is Pos. 
Pos is top, middle, or bottom . 

• h_centered (Obj ) : Obj is horizontally centered. 4 mm or less error is 
allowed. 

• For text: 

• text_align (Obj , AlignType) : Alignment type between lines in Obj is 
AlignType. 

• f ont_size (Obj , F) : The size of font used in Obj is F, in millimeters. 

• column (Obj , N) : the number of columns in Obj is N. 

• For relationships between objects: 

• h_align (Case, Obj 1 , Obj 2 , AlignType) : On page PagelD (in Case), 
Ob j 1 and Obj 2 are horizontally aligned. AlignType is left, right, 
center, or just. 

• v_align (Case , Obj 1 , Obj 2 , AlignType) : Ob j 1 and Ob j 2 are vertically 
aligned. AlignType is top, bottom, center, or just. Note that for these 
two predicates, 1 mm error is allowed in alignment. 

• above (Case , Obj 1 , Obj 2 , Distance) : Obj 1 is located above Obj 2 and the 
distance between the two is Distance, in millimeters. 

• direct_above (Case , Obj 1 , Ob j 2 , Distance) : Ob j 1 is located directly 
(no object exists between the two) above Ob j 2. 

• overlapping (Case , Obj 1 , Obj 2 ) : Obj 1 and Obj 2 overlap each other. 

Our page model and our background knowledge used here are unintentionally 
similar to the ones used in document layout recognition problems [Esposito 94]. 
Flowever, there are differences, such as permission of overlapping and features for 
text, because different problems are treated. 



6 Learning Methods and Experimental Results 

In our experiments, we use three different learning methods. 

6.1 Normal ILP 

We sampled 20 advertisement pages from the magazine BYTE and constructed a 
positive example set from all objects on the pages. Among randomly generated 
examples, only ones whose corresponding object falls within the page area form the 
negative example set. In all experiments, we used this dataset comprising 228 
positive examples and 1039 negative examples. We use an ILP system Progol 
(CProgol ver4.4) [Muggleton 95]. 

Below, we show the 1 1 rules generated by Progol when the noise parameter was 
60%. The noise parameter indicates the percentage of the negative examples that 
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learned rules are allowed to predict. The running time (on a Sun EnterpriseSOOO) was 
approximately 29 minutes. 

(1) good_locatel (A, B) form_obj (A, B, C) , 

h_align (A, C , D , lef t ) . 

(2) good_locatel (A, B) form_obj (A, B, C) , 

h_align (A, C , D , right ) . 

(3) good_locatel (A, B) form_obj (A, B, C) , 

h_align (A, C , D , center) . 

(4) good_locatel (A, B) form_obj (A, B, C) , 

obj_type (C, image) . 

(5) good_locatel (A, B) form_obj (A, B, C) , 

v_align (A, C, D, center) , v_position (C, bottom) . 

(6) good_locatel (A, B) form_obj (A, B, C) , 

h_position (C, right) , v_position (C, bottom) . 

(7) good_locatel (A, B) form_obj (A, B, C) , 

v_align (A, C, D, center) , v_align (A, C, D, top) , 

v_position (C, top) . 

(8) good_locatel (A, B) form_obj (A, B, C) , 

v_align (A, C, D, top) , obj_type (C, illustration) , 

h_position (C, right) , obj_type (D, illustration) . 

(9) good_locatel (A, B) form_obj (A, B, C) , 

v_position (C, bottom) , obj_height (C, D) , 16=<D, 

D = <16 . 

(10) good_locatel (A, B) form_obj (A, B, C) , 

v_position (C, bottom) , obj_height (C, D) , 28=<D, 

D=<28 . 

(11) good_locatel (A, B) form_obj (A, B, C) , 

obj_width (C, D) , obj_height (C, E) , 14=<E, 168=<D. 

The first rule learned states that B is a good location in case A if the object C 
located at B is left aligned with another object D. Six learned rules out of 1 1 contain 
predicates representing relationships between objects. It shows that ILP is suitable for 
acquiring layout knowledge because only ILP can generate such rules. While most 
rules are general and characterize good layouts, rules (4) and (11) are not useful, since 
they do not say anything about locations. We think rules (6), (9), and (10) 
characterize this particular dataset, and rules (7) and (8) seem to be overly specific. 

We measured predictive accuracy on unseen cases using five-fold cross 
validations. Twenty sample pages were split into five folds, each of which contains 
four pages. Figure 3 shows the performance results as the noise parameter was 
varied. We do not use conventional accuracy because the number of negative 
examples is arbitrary. Instead, we measured accuracy independently for positive and 
negative examples. In Figure 3, Predictive value(H-) = TP/R*, where TP is the number 
of correctly predicted positive examples, and R* is the number of positive examples. 
Similarly, Predictive value(-) = TNIR , where 77V is the number of correctly predicted 
negative examples, and R is the number of negative examples. To average, we 
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adopted the score [Slattery 98, Rijsbergen 79]. This is defined as: = 2P*P 

/(P*+P), where P* and P are predictive values(+) and (-), respectively. In our 
experiment, the F^ score peaked at 77.7% when the noise was 30%. 
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Fig. 3. Performance of normal learning procedure. F stands for F, score. 



6.2 Nonmonotonic Learning 

Here we present a repeat learning method based on nonmonotonic learning. This 
method can treat layout exceptions, prohibiting rules in design knowledge which are 
naturally described using negation. In our method, the ILP algorithm is executed 
twice, and positive and negative examples are exchanged in the second learning 
session. 

This method introduces a new predicate for exceptions in a way similar to the 
nonmonotonic learning of Closed World Specialization (CWS) [Bain 91, Srinivasan 
92]. Whereas CWS first selects an over-general clause then specializes it with a new 
predicate, our method specializes a whole logic program (a set of clauses). In our 
approach, the ILP algorithm is not modified but is treated as a component. 

The following is our nonmonotonic learning algorithm. In the algorithm, 
ILP(BK,Pos,Neg) outputs a (possibly over-general) logic program by an ILP 
algorithm. 

Input: background knowledge BK, a set of positive examples Pos, and a set of 
negative examples Neg 
H, = ILP(BK,Pos,Neg) 

Neg2 = subset of Neg covered by Fl^ 

= ILP(BK,Neg2,Pos) 

Output: H = H^ f\ ~H^ 

In Prolog, we have to rename the predicates in the head of Fl^ and Fl^ individually 
and add a rule using negation as failure like: 
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target (VI , , Vn) : - targetl (VI , ... , Vn) , 
not (target2 (VI, . . . ,Vn) ) . 

In our experiment, we add the following rule: 

good_locatel (Case, Log) 

good_locatel_learnedl (Case, Log) , 
not (good_loGatel_learned2 (Case , Log) ) . 

We use Progol as the ILP algorithm. The noise parameter for the first learning 
session is set to 60% so that rules generated include numerous exceptions, since our 
preliminary experiments showed that a higher initial noise parameter yields a higher 
final F, score. In our experiment, 464 examples remained as negative examples for 
the second learning. Note that method requires many negative examples in the 
beginning, so that enough examples remain. In our case, fortunately, this requirement 
is met by the Closed World Assumption. 

This algorithm generated the following 11 rules when the noise in the second 
learning was 5%. 

(1) good_loGatel (A, B) form_obj (A, B, C) , 

obj_height (C, D) , f ont_size (C, D) . 

(2) good_loGatel (A, B) form_obj (A, B, C) , 

obj_type (C, image) , overlapping (A, C, D) , 
obj_type (D, illustration) . 

(3) good_loGatel (A, B) form_obj (A, B, C) , 

h_position (C, right) , overlapping (A, C, D) , 
h_position (D, right) . 

(4) good_loGatel (A, B) form_obj (A, B, C) , 

v_position (C, top) , obj_height (C, D) , D=<2.000. 

(5) good_loGatel (A, B) form_obj (A, B, C) , 

v_position (C, top) , overlapping (A, C, D) , 
text_align (D, just) . 

(6) good_loGatel (A, B) form_obj (A, B, C) , 

overlapping (A, C, D) , v_align (A, D, E, eenter) , 
obj_type (D, image) . 

(7) good_loGatel (A, B) form_obj (A, B, C) , 

overlapping (A, C, D) , overlapping (A, C, E) , 
overlapping (A, D, E) . 

(8) good_loGatel (A, B) form_obj (A, B, C) , 

v_align (A, C, D, eenter) , obj_type (C, text) , 
overlapping (A, C, E) , h_centered (E) . 

(9) good_loGatel (A, B) form_obj (A, B, C) , 

v_position (C, top) , overlapping (A, C, D) , 
h_align (A, D , E , lef t ) , above (A, D, E, F) . 

(10) good_loGatel (A, B) form_obj (A, B, C) , 

overlapping (A, C, D) , overlapping (A, C, E) , 
overlapping (A, D, E) , above (A, F, E, G) . 
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(11) good_locatel (A, B) form_obj (A, B, C) , 

overlapping (A, C, D) , overlapping (A, C, E) , 
overlapping (A, E, F) , above (A, D, E, G) . 

We think all rales except rale (1) correctly state prohibited positions in layouts. 
For example, rule (4) says that an object whose height is 2 mm or less should not be 
placed on the top area of a page. Rule (7) says that no three different objects should 
overlap. 

Figure 4 shows the accuracy (of H) using similar five-fold cross validations, with 
varying noise in the second learning session. The score peaked at 84.8% when the 
noise was 5%, demonstrating that our nonmonotonic algorithm outperforms the 
normal learning. 
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Fig. 4. Performance of nonmonotonic learning. F stands for F, score. 



6.3 Another Repeat Learning 

The location of an element satisfies many rales, not one. Here we introduce another 
repeat learning method mainly to learn many such rales. This method is similar to the 
one given in the previous section, except that positive and negative examples for the 
second learning session are reversed. This method also introduces a new predicate in 
a way similar to that in the concept learning tool CLT [Wrobel 94] of MOBAL. Both 
introduce a new predicate for positive examples of the target concept. As in CWS, 
CLT specializes an over-general clause. That is the difference between our method 
and CLT. 

The following is our repeat learning algorithm. 

Input: background knowledge BK, a set of positive examples Pos, and a set of 
negative examples Neg 
H, = ILP(BK,Pos,Neg) 

Neg2 = subset of Neg covered by 
= ILP(BK, Pos, Neg2) 

Output: // = //, A Ffj 
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In Prolog, as in the previous method, we have to rename the predicates in the head 
of f/j and individually and add a rule like: 

target (VI , , Vn) : - targetl (VI , ... , Vn) , 

@ @ target2 (VI, . . . ,Vn) . 

Similarly, we used Progol as the ILP algorithm. The noise parameter for the first 
session is set to 60%. We show a subset of 11 generated rules when the noise in the 
second learning session was 40%. 

(2) good_locatel (A, B) form_obj (A, B, C) , 
h_centered (C) . 

(6) good_locatel (A, B) form_obj (A, B, C) , 

h_align (A, C, D, center) , text_align (C, center) , 
h_align (A, D , E , center ) . 

(8) good_locatel (A, B) form_obj (A, B, C) , 

h_align (A, C, D, center) , h_position (C, right) , 
v_position (C, bottom) . 

(9) good_locatel (A, B) form_obj (A, B, C) , 

v_position (C, bottom) , obj_height (C, D) , D=<2 . 

Most rules are more specific than those generated by the first learning session. 
These rules can be regarded as new knowledge that the first session cannot acquire. 
For accuracy, the score peaked at 76.9%, almost the same value as in the first 
experiment. 



7 Conclusions 

We applied Inductive Logic Programming to the graphic design domain, which is 
novel for ILP. As a result of our experiments using real-world data, we obtained rules 
characterizing good layouts and including relationships between elements. Moreover, 
we presented a new method based on nonmonotonic learning where the ILP algorithm 
is used twice. Our experiments showed that the method can learn more accurate rules 
than normal ILP can. This proves the power of the method. 

The rules obtained here can be directly used for automated layout of "the last 
element." Since rules can be regarded as constraints, a locating problem can be 
regarded as a constraint-satisfying problem. Although a constraint satisfying problem 
can be over-constrained or under-constrained, it is handled well by introducing 
constraint hierarchy [Burning 87], which we think is suitable for this layout problem. 
In our approach, constraint hierarchies can be easily constructed by specifying the 
rules generated by the first and second learning sessions with different strengths. 

Realizing an automated element locating system is a topic for further study. 
Another topic for future work is improving predictive accuracy. We plan to improve 
or increase the background knowledge, guided by analysis of examples predicted 
incorrectly. Other topics are increasing samples by automated data transformation 
from electronic documents and experimenting with other kind of advertisements than 
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BYTE or other kinds of documents. We believe our proposed nonmonotonic learning 
approach is applicable to domains other than graphic design, especially to domains in 
which the Closed World Assumption can be used to give many negative examples 
initially. 
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Abstract. We consider the task of tagging Slovene words with morpho- 
syntactic descriptions (MSDs). MSDs contain not only part-of-speech 
information but also attributes such as gender and case. In the case of 
Slovene there are 2,083 possible MSDs. P-Progol was used to learn mor- 
phosyntactic disambiguation rules from annotated data (consisting of 
161,314 examples) produced by the MULTEXT-East project. P-Progol 
produced 1,148 rules taking 36 hours. Using simple grammatical back- 
ground knowledge, e.g. looking for case disagreement, P-Progol induced 
4,094 clauses in eight parallel runs. These rules have proved effective 
at detecting and explaining incorrect MSD annotations in an indepen- 
dent test set, but have not so far produced a tagger comparable to other 
existing taggers in terms of accuracy. 



1 Introduction 

While tagging has been extensively studied for English and some other Western 
European languages, much less work has been done on Slavic languages. The 
results for English do not necessarily carry over to these languages. The tagsets 
for Slavic languages are typically much larger (over 1000), due to their many 
inflectional features; on the other hand, training corpora tend to be smaller. 

In work related to this |0| a number of taggers were applied to the problem 
of tagging Slovene. Four different taggers were trained and tested on a hand 
annotated corpus of Slovene, the translation of the novel T984’ by G. Orwell. 
The taggers tested were the HMM tagger [bllb| . Brill’s Rule based tagger 0, the 
Maximum Entropy Tagger and the Memory-based Tagger (Zj. Accuracies 
on ‘known’ words were mostly a little over 90%, with the Memory-Based Tagger 
achieving 93.58%. Known words are those found in a lexicon that accompanies 
the corpus. Our goal here was to see whether ILP (specifically P-Progol) could be 
used to learn rules for tagging, to analyse the rules and to compare empirically 
with these other approaches to tagging. 



S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 68-^^ 1999. 
@ Springer- Verlag Berlin Heidelberg 1999 
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2 Morphosyntactic Descriptions 

The EU-funded MULTEXT-East project jH] developed corpora, lexica and tools 
for six Central and East-European languages. The centrepiece of the corpus is 
the novel “1984” by George Orwell, in the English original and translations m 
The novel has been hand-tagged with disambiguated morphosyntactic descripti- 
ons (MSDs) and lemmas. The novel is marked up for sentences and tokens; these 
can be either punctuation or words. Each punctuation symbol has its own cor- 
pus tag (e.g. XMDASH), while the words are marked by their morphosyntactic 
descriptions. 

The syntax and semantics of the MULTEXT-East MSDs are given in the 
morphosyntactic specifications of the project cni. These specifications have been 
developed in the formalism and on the basis of specifications for six Western Eu- 
ropean languages of the EU MULTEXT project P; the MULTEXT project pro- 
duced its specifications in cooperation with EAGLES (Expert Advisory Group 
on Language Engineering Standards) 0. 

The MULTEXT-East morphosyntactic specifications contain, along with in- 
troductory matter, also: 

1. the list of defined categories (parts-of-speech) 

2. common tables of attribute- values 

3. language particular tables 

Of the MULTEXT-East categories, Slovene uses Noun (N), Verb (V), Adjec- 
tive (A), Pronoun (P), Adverb (R)jAdposition (S)0 Gonjunction (G), Numeral 
(M), Interjection (I), Residual (X) 13 Abbreviation (Y), and Particle (Q). 

The morphosyntactic specifications provide the grammars for the MSDs of 
the MULTEXT-East languages. The greatest worth of these specifications is that 
they provide an attempt at a morphosyntactic encoding standardised across lan- 
guages. In addition to already encompassing seven typologically very different 
languages, the structure of the specifications and of the MSDs is readily exten- 
sible to new languages. 

To give an impression of the information content of the Slovene MSDs and 
their distribution, Table^gives, for each category, the number of attributes in the 
category, the total number of values for all attributes in the category; the number 
of different MSDs in the lexicon, and, finally, in the annotated MULTEXT-East 
Slovene T984’ corpus. 

To exemplify the annotation use. Fig Q gives the MSD-annotated first sen- 
tence of the Slovene translation of the novel: It was a bright cold day in April, and 
the cloeks were striking thirteen. The Bil/Vcps-sma annotation shows that “Bil” 
is a singular masculine past participle in the active voice, and the jasen/Afpmsnn 
annotation shows that “jasen” is an adjective which is indefinite, nominative, 
singular, masculine, positive and qualificative. 

^ Adpositions include prepositions and postpositions; Slovene uses only prepositions. 
^ Residual is a category encompassing unknown (unanalysable) lexical items. It ap- 
pears only once in the Slovene lexicon, for 2+2=5; in our experiment we also used it 
to mark punctuation. 
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Table 1. Slovene morphosyntactic distribution 



PoS 


Att Val 


Lex 


Cor 


Noun 


5 


16 


99 


74 


Verb 


8 


26 


128 


93 


Adjective 


7 


22 


279 


169 


Pronoun 


11 


36 


1,335 


594 


Adverb 


2 


4 


3 


3 


Adposition 


3 


8 


6 


6 


Conjunction 


2 


4 


3 


2 


Numeral 


7 


23 


226 


80 


Interjection 


0 


0 


1 


1 


Residual 


0 


0 


1 


1 


Abbreviation 


0 


0 


1 


1 


Particle 


0 


0 


1 


1 


All 


45 139 


2,083 1,025 



Bil/Vcps-sma je/Vcip3s — n jasen/Afpmsnn &comma; /XCOMMA 
mrzel/Afpmsnn aprilski/Aopmsn dan/Ncmsn in/Ccs 

ure/Ncfpn so/Vcip3p — n bile/Vmps-pf a trinaj st/Mcnpnl &period; /XPERIOD 
Fig. 1. First MSD-annotated sentence of Slovene translation of Orwell’s “1984” 



3 Method 

Following the basic approach taken in Em, we used ILP to learn MSD elimi- 
nation rules each of which identify a set of MSDs that cannot be correct for a 
word in a particular context. The context for a word is given as the MSDs of 
all words to the left and to the right of the word. Not using the actual words in 
the context simplified the learning, and is justified on the grounds that MSDs 
(unlike, say, Penn Treebank PoS tags) provide very specific information about 
the words. 

The MULTEXT-East lexicon provides an ambiguity class for the Slovene words 
appearing in the corpus. For a given word, this is the set of possible MSDs for 
that word. Elimination rules can then be applied to reduce this ambiguity class 
in a particular context, ideally reducing it to a single MSDs. Note that each rule 
requires a word’s context to be sufficiently disambiguated so that it can fire. 
This motivates using elimination rules in tandem with another tagger. 



3.1 Examples 

Each ambiguous word generated a single negative example and one or more po- 
sitive examples. Each negative example is represented as a triple of left context, 
correct MSD and right context. The correct MSD generates a negative example 
for the induction of elimination rules since it identifies an MSD which it would 
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be incorrect to eliminate. Positive examples contain MSDs which are incorrect 
(positive examples of MSDs to eliminate) but which are in the focus word’s 
ambiguity class. 

The left context is reversed so that the MSD immediately to the left of 
the focus is at the head of the list. Figure |2| show two positive and one ne- 
gative example generated from a single occurrence of a word with ambiguity 

class {pp3fsg y_n,vmip3s n,vcip3s n}. In this context, vcip3s n is the 

correct MSD and hence appears in a negative example of elimination. MSDs 
are represented as constants for efficiency. We used exactly the same training 
data that had been used for training the other taggers in 0. This data, to- 
gether with other resources to reproduce our experiments, can be found at: 
http://alibaba.ijs.si/et/project/LLL/tag/. We produced 99,261 positive 
and 81,805 negative examples giving a total of 181,066 examples. 



"/,rmv (Lef tRever sed , Focus , Right ) 

•/.2 POSITIVE EXAMPLES 

rmv( [vcps_sma] , [ppSfsg y_n] , [afpmsnn,xcomma,afpmsnn,aopmsn,ncmsn,ccs, 

ncfpn,vcip3p n,vmps_pfa,mcnpnI,xperiod] ) . 

rmv( [vcps_sma] , [vmip3s n] , [afpmsnn,xcomma,afpmsnii,aopmsn,ncmsn,ccs, 

ncfpn,vcip3p n,vmps_pfa,mcnpnI,xperiod] ) . 

•/.I NEGATIVE EXAMPLE 

rmv( [vcps_sma] , [vcip3s n] , [afpmsnn,xcoimna,afpmsnn,aopmsn,ncmsn,ccs, 

ncfpn,vcip3p n,vmps_pfa,mcnpnI,xperiod] ) . 

Fig. 2. Examples created from a single occurrence of an ambiguous word 



3.2 Background Knowledge 

The use of ILP for tagging is particularly well motivated when the tags (here 
MSDs) have considerable structure. The background knowledge was designed 
to take advantage of that structure. Figure 0 shows some of the background 
predicates used. 

Working through Figure0 we have firstly, msd/2 which explodes MSDs from 
constants into lists so that other predicates can extract the relevant structure 
from the MSDs; there were 1703 such msd/2 facts in our background knowledge. 
Many of the background predicates consume an initial portion of a word’s context 
(left or right) and return the remainder of the context as an output in the second 
argument. For example, noun(A,B) is true if A begins with a noun and is followed 
by B. We have the predicates gender/3, case/3 and number/3, which identify 
gender, case and number or fail for MSDs where these are not defined. Figure 0 
shows two of the gender/2 clauses which show that the gender identifier is the 
3rd attribute for noun MSDs and the 4th for pronoun. The most important 



72 



J. Cussens, S. Dzeroski, and T. Erjavec 



predicates are disoncase/2 disongender/2 and disonnumb/2 which indicate 
when two MSDs disagree in case, gender or number. 

We also have simple phrasal definitions. Noun phrases np/1 are defined as 
zero or more adjectives followed by one or more nouns. This is clearly not a full 
definition of noun phrase, but is included on the grounds that the simple noun 
phrases so defined will be useful features for the elimination rules. Finally, we 
have isa/2 which identifies particular MSDs and skip_over/2 which is used 
to skip over apparently unimportant tokens which do not have case, number or 
gender defined. 

•/.EXPLODING MSD CONSTANTS 
msdCafcfda, [a,f , c ,f ,d, a] ) . 
msdCafcfdg, [a,f , c ,f ,d,g] ) . 
msdCafcfdl, [a,f , c ,f ,d,l] ) . 

•/.Parts of speech, always first letter 
noun( [M|T] ,T) msd(M, [n I _] ) . 

verb( [M|T] ,T) msd(M, [v I _] ) . 

•/.GENDER 

gender ( [Ml T] , Gender ,T) msd(M, [n, _ .Gender I _]) . 
gender ( [Ml T] , Gender ,T) msd(M, [p, .Gender I _]) . 

•/.DISAGREEMENT ON CASE, GENDER 0R NUMBER 

disoncase (Ml ,M2) case(Ml ,C1 ,_) , case(M2,C2,_) , \+ Cl = C2. 
disongender (Ml ,M2) gender (Ml , Cl ,_) , gender (M2, C2, _) , \+ Cl = C2. 
disonnumb(Ml,M2) numb(Ml ,C1 ,_) , numb(M2,C2,_) , \+ Cl = C2. 

•/.NOUN PHRASE 

np(A,B) adjective_star(A,C) , noun_plus(C,B) . 

•/.and backwards . . 

npl(A,B) noun_plus(A,C) , adjective_star(C,B) . 

noun_plus (A,B) noun(A,B) . 

noun_plus (A,B) noun (A, C), noun_plus(C,B) . 

•/.IDENTIFYING PARTICUAR MSDS 
isa([H|T] ,H,T) . 

•/.FOR SKIPPING TO IMPORTANT WORDS 
skip_over (A,B) 

all_undef ined_plus(A,B) , 
some_def ined(B) . 



Fig. 3. Excerpt of background knowledge 
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3.3 Splitting the Data 

P-Progol is currently unable to accept 181,065 (fairly complex) examples directly. 
The data was therefore split according to the part-of-speech of the focus MSD 
(the 3rd argument of the examples). This formed the 8 data sets for Noun (n), 
Verb (v), Adjective (a), Pronoun (p), Adverb (r), Adposition (s), Numeral (m) 
and Other (o) described in Table 0 The “Other” dataset covered conjunctions 
(c) and particles (q) together. Although large by ILP standards, each of these 
datasets was sufficiently small for P-Progol 2.4.7 running on Yap Prolog 4.1.15. 

Although motivated by pragmatics, this splitting had a number of beneficial 
effects. The split meant that all eight datasets could have been processed in 
parallel as eight separate Yap processes. In fact, due to a lack of suitable machines 
the work was spread between the first author’s Viglen laptop (233 MHz Pentium, 
80 MBytes RAM) and Steve Moyle’s PC (266 MHz, 128 MBytes RAM). These 
machines are denoted Y and O respectively in Table 0 Since we had 8 rule 
sets induced for specific parts-of-speech we were able to index on the part-of- 
speech by altering the induced rules to have the relevant part-of-speech as a first 
argument . 

In effect, we performed a single initial greedy split of the data as would be 
done as the first step in a decision tree inducer such as TILDE Pj. Since many 
of the clauses induced in earlier work on random samples of the complete data 
set were specific to aparticular part-of-speech (e.g. rmv(L,F,R) noun(L,L2) 

. . . ), we will not have missed many good clauses as a result of our greediness. 

3.4 P-Progol Parameters and Constraints 

As well as limiting the amount of data input to a particular P-Progol run, we also 
constrained the Progol search in two major ways. The basic Progol algorithm 
consists of taking a ‘seed’ uncovered positive example, producing a most specific 
‘bottom clause’ which covers it and then using the bottom clause to guide the 
search for the ‘best’ clause that covers the seed. 

P-Progol has a number of built-in cost functions: the ‘best’ clause is that 
which minimises this cost. In this work, we choose m-estimation uni p. 179-180] 
to estimate the accuracy of clauses, and searched for the clause that maximised 
estimated accuracy. An m value of 1 was chosen. Such a low value of m might 
allow overfitting, so as a guard against this, only clauses which covered at least 
10 positives were allowed. Such a stopping rule has the advantage of allowing the 
search to be pruned. If a clause dips below 10 positives then there is no point 
considering any specialisations of that clause, since they will also cover fewer 
than 10 positives. Also, only clauses with at least 97% training accuracy were 
allowed. 

Two more constraints were required for learning to be feasible. Firstly, we 
restricted each Progol search to a maximum of 5000 clauses — many searches hit 
this threshold. Secondly, we limited clauses to a maximum of four literals, the 
only exception being the Numeral (m) run because of its small example set. 
Caching (0) was only used on two small runs to avoid any risk of running out 
of RAM. 
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4 Results 

4.1 The Induction Process 

Considerable effort was expended in tracking down bugs in earlier versions of 
Yap Prolog which involved indexing problems for large data sets. Ashwin Sri- 
nivasan and the first author also implemented improvements to P-Progol which 
considerably improved its efficiency. Despite these (productive) efforts, the large 
number of examples and the nature of the Progol search meant long P-Progol 
runs sometimes lasting a few days (see Table 0 ). 



Table 2. Data set and induction statistics 



PoS 


Pos 


Neg 


Tot 


Rules 


Time (hrs) 


Searches 


Machine 


Caching 


iq 


a 


29099 


5709 


34808 


1148 


36.1 


1513 


Y 


on 


4 


m 


1972 


843 


2815 


223 


2.6 


305 


Y 


off 


5 


n 


20125 


14569 


34694 


809 


54.0 


2767 


0 


off 


4 


o 


2279 


21946 


24225 


36 


15.8 


1960 


Y 


on 


4 


p 


26585 


8480 


35065 


1291 


54.5 


2750 


0 


off 


4 


r 


2500 


4980 


7480 


42 


5.3 


1941 


0 


off 


4 


s 


4865 


6025 


10890 


90 


2.3 


293 


Y 


off 


4 


V 


11836 


19253 


31089 


455 


25.6 


1599 


0 


off 


4 


All 


99261 


81805 


181066 


4094 


196.2 


13128 









4.2 Structure of Induced Theories 

P-Progol associates a clause label with each induced clause which gives the 
positive and negative cover, and the clause’s ‘score’. In our case the score was 
clause accuracy as estimated by the m-estimate. This allowed us to parameterise 
the induced theory, converting a clause such as 

rmv(L,F,R) case(R,n,R2) , disonnmnb(F,R2) , disonnumb(F,R) . 

from the adjective theory, to: 

rmv ( a, L,F,R, score (1283, 1,0. 999094) ) : - 

case(R,n,R2) , disonnumb(F,R2) , disonnmnb(F,R) . 

This allowed us to produce subsets of the complete theory by thresholding on 
the m-estimated accuracy (EstAcc). For example, filtering out all rules with 
EstAcc < 0.999 results in an under-general theory but with only ultra-reliable 
rules remaining. Note also the added “a” index which indicates that the rule 
only applies when F (the focus word) is an adjective. 

The isa/3 predicate appears very often in the induced theory, indicating the 
importance of MSD-specific rules. Also, as expected many of the rules look for 
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disagreement between neighbouring words. In the EstAcc > 0.999 exactly half 
of the 240 rules used the disagreement predicates. Many of those that did not, 
used two literals to specify particular disagreements. 

In the complete theory, 2069 of the 4094 rules used only features of chunks 
(words or simple phrases) right next to the focus word. In the EstAcc > 0.999 
subtheory, 173 out of the 240 rules used only such features, showing that most 
highly reliable rules are quite simple, identifying anomalies between neighbouring 
words. Of the remaining 67 rules with EstAcc > 0.999, all of them only looked 
one chunk beyond neighbouring chunks. 



rmv(a,L,F,R, score (1130,0 ,0 . 999855) ) : - 
case(F,n,D), numb(F,d,D) , disonnumb(F,R) . 
rmv(a,L,F,R, score (748,0,0. 999781) ) : - 
case(R,l,R2) , gender (R,f ,R2) , gender (F,m, _) . 
rmv(v,L,F,R, score (619,0,0. 999001) ) : - 

numb(F,p,D) , gender (F,n,D) , isa(L,vcip3s n,L2) . 

rmv(n,L,F,R, score (600,0, 0 . 999301) ) : - 
numb(F,d,_) , isa(L,spsl,L2) . 
rmv(n,L,F,R, score (433,0, 0 . 999032) ) : - 
case(F,a,_), isa(L,spsg,L2) . 
rmv(m,L,F,R, score (14, 0, 0 . 980036) ) : - 
gender(F,f ,_) , isa(L,spsi,L2) , 
numb(L2,_,L3) , case (L3 , _ ,L4) . 

Fig. 4. A subset of the induced disambiguation theory 



4.3 Consistency Checking and Ambiguity Reduction 

Here we tested the consistency of each test sentence with the rules. As Table 0 
shows, a good half of the correct readings are deemed inadmissible by the com- 
plete theory. This is because it only takes one disambiguation rule to incorrectly 
fire for a whole sentence to be rejected. 



Table 3. Proportion of test sentence annotations rejected 



\EstAcc > 


II 0|96.0|97.0|98.0|98.5|99.0|99.5|99.7|99.9| 




||49.5|48.6|46.4|36.9|32.4|24.7|15.8|10.9| 


3.5 



To measure ambiguity reduction, we selected those 263 sentences from the 
test set, which had fewer than 2000 possible annotations according to the am- 
biguity classes of the words in the sentences. Many sentences have millions of 
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possible annotations, most of them plainly absurd, so we do not wish to re- 
ceive credit for eliminating these. Tabled shows the ambiguity reduction factor 
(ARF) for each subtheory: we summed the number of possible annotations for 
each sentences in the test set giving a total of 81634 annotations. To get the 
ARF we divided this number by the number of annotations consistent with the 
rules. We also give the rejection error rate (RER), the percentage of times that 
the annotation given in the test set was inconsistent with the rules. 



Table 4. Per sentence ambiguity reduction factor and rejection rate, for sentences with 
fewer than 2000 possible annotations 



EstAcc > 


0 


96.0 


97.0 


98.0 


98.5 


99.0 


99.5 


99.7 


99.9 


ARF 


67.3 


65.1 


56.6 


38.0 


29.4 


21.1 


10.6 


6.9 


2.7 


RER 


25.5 


24.7 


22.1 


17.5 


13.3 


8.7 


3.0 


2.3 


0.4 



The ambiguity reduction factor is good, even the EstAcc > 0.999 theory 
reduces sentence ambiguity by nearly a third. However, to use the rules to reduce 
ambiguity, we should be almost guaranteed not to reject the correct annotation; 
this means only the small theories composed only of highly reliable rules should 
be used. 

4.4 Error Detection 

The 0.4% RE for the EstAcc > 0.999 was due to a single test set annota- 
tion being rejected. The annotated test sentence was: “[Winston/npmsn] [in/ccs] 

[Julia/npf sn] [sta/vcip3d n] [se/px y] [ocarana/afpmdn] 

[objela/vmps_sf a] [./xperiod]” (“Delighted, Winston and Julia embraced.”) 
This annotation was rejected by this rule: 

rmv (a, L , F , R, score ( 1 130 ,0,0. 999855) ) : - 

case(F,n,D), numb(F,d,D) , disonnmnb(F,R) . 

on the grounds that dual nominative adjectives can not be followed by any 
word that does not have the same number. This rules out the “ocarana/afpmdn, 
objela/vmps_sf a” annotation 

Upon inspection we found that the rule was correct to reject this annotation: 
“objela” (embraced) should have been tagged as dual not singular. This lead 
us to use the EstAcc > 0.999 theory to look for other possible errors in the 
complete test set. To help us do this we wrote a simple Prolog interface which 
flagged possible errors and explained why they were suspected errors. Figure 0 
shows how the interface flagged the “objela” error. 

This demonstrates two points. Firstly our disambiguation rules can be used 
to detect incorrect annotations, and provide an explanation of why the annot- 
ation is incorrect. Secondly, our rules are constraints that apply not only to 
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**ERR0R DETECTED** 

[(Winston_BOS,npmsn) , (in,ccs) , (Julija.npf sn) , (sta,vcip3d n) , 

(se.px y)] 

ocxarana, afpmdn <= HERE 
[(objela, vmps_sfa) , ( . .xperiod)] 

Constraint number 1 confidence score (1130, 0,0. 999855) 
because : 

ocxarana, afpmdn is ambiguous, and is (apparently) an adjective 

and we can not also have: 
ocxarana, afpmdn with case: n 
ocxarana, afpmdn with numb: d 

objela, vmps_sf a and ocxarana, afpmdn disagreeing on number 
Enter y if this is a real error 



Fig. 5. Interface for annotation error flagging 



the focus word. Here, “ocarana”, the focus word, was annotated correctly — the 
inconsistency was detected in its context. 

However, of the 24 alleged test set errors flagged by our constraints, only 
9 turned out to be actual errors. The other 15 were examples of rare atypical 
constructions. All the constraints which incorrectly flagged errors had EstAcc > 
0.999. For example, one had covered 761 positives and 0 negatives in the training 
data. So the large number of errors is perhaps surprising and points to possible 
over-fitting with an inadequately large training dataset. On the other hand, as 
an annotation validation tool, performance is reasonable, and it would be easy 
to expand the explanations and allow the user to correct (real) errors as they 
are presented. 

4.5 Tagging Accuracy 

Our MSD elimination rules can not be used as a standalone tagger: they rely 
too heavily on disambiguated context and there is no guarantee that a single 
MSD will be returned for each word after incorrect MSDs have been eliminated. 
We propose that they can be used as a Alter to reject inconsistent annotations 
produced by another tagger, such as those mentioned in Section ^ 

Here we combine our rules with the simplest tagger — one that returns the 
most likely tag based on lexical statistics, without taking context into account. 
Our goal is to measure the degree to which accuracy increases once the rules are 
used to Alter out incorrect annotations. 

Due to problems with Sicstus Prolog, experiments were conducted on a subset 
of 526 of the original 650 test sentences. These were sentences with fewer than 
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2000 possible annotations. Choosing the most likely tag according to lexical 
statistics produces an accuracy of 83.3% on this test set. We combined this 
tagger with our rules by doing a uniform cost search i.e. (^4* with h = 0) for the 
most probable sentence annotation according to the lexical statistics, which was 
consistent with our rules. 

Using the complete theory we achieved 86.6%. Some subtheories did a little 
better, for example the EstAcc > 0.985 accuracy was 87.5% So we have an 
improvement, albeit a modest one. Our experiments with error flagging reported 
in Section 14.41 indicate that a major barrier to improved performance is that our 
constraints frequently reject correct annotations. 

It remains to be seen what improvement, if any, can be achieved when marry- 
ing our rules to more sophisticated taggers such as those mentioned in Section □ 
Clearly the combination examined here has far lower performance than the tag- 
gers mentioned in Section H 

5 Conclusions and Future Work 

In this work, we have established the following positive results: 

1. P-Progol can be applied directly to datasets of at least 30,000 examples. 
With appropriate use of sampling, it is likely that this could upper limit 
could be increased considerably. 

2. We have induced MSD elimination rules which can be used to filter out incor- 
rect annotations. The symbolic nature of the rules means that an explanation 
is also supplied. This makes using these rules particularly appropriate for an 
interactive system — we intend to use the rules induced here to check the 
existing MULTEXT-East corpus 

We have also established the following negative result: 

1. The performance of the MSD elimination rules as a standalone system or in 
tandem with a crude tagger based on lexical statistics is considerably worse 
than that of competing taggers. 

Apart from checking the MULTEXT-East corpus with the rules, we also intend 
to use the rules to check the annotations proposed by the taggers mentioned 
in Section ^ By filtering out at least some incorrect annotations, the tagging 
accuracy should increase. 
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Abstract. We present a novel application of inductive logic program- 
ming (ILP) in the area of quantitative structure-activity relationships 
(QSARs). The activity we want to predict is the biodegradability of 
chemical compounds in water. In particular, the target variable is the 
half-life in water for aerobic aqueous biodegradation. Structural descrip- 
tions of chemicals in terms of atoms and bonds are derived from the 
chemicals’ SMILES encodings. Definition of substructures are used as 
background knowledge. Predicting biodegradability is essentially a re- 
gression problem, but we also consider a discretized version of the target 
variable. We thus employ a number of relational classification and re- 
gression methods on the relational representation and compare these to 
propositional methods applied to different propositionalisations of the 
problem. Some expert comments on the induced theories are also given. 



1 Introduction 

The persistence of chemicals in the environment (or to environmental influences) 
is welcome only until the time the chemicals fulfill their role. After that time or 
if they happen to be at the wrong place, the chemicals are considered pollutants. 
In this phase of chemicals’ life-span we wish that the chemicals disappear as soon 
as possible. The most ecologically acceptable (and a very cost-effective) way of 
’disappearing’ is degradation to components which are not considered pollutants 
(e.g. mineralization of organic compounds). Degradation in the environment can 
take several forms, from physical pathways (erosion, photolysis, etc.), through 
chemical pathways (hydrolysis, oxydation, diverse chemolises, etc.) to biological 
pathways (biolysis). Usually the pathways are combined and interrelated, thus 
making degradation even more complex. In our study we focus on biodegradation 
in an aqueous environment under aerobic conditions, which affects the quality 
of surface- and groundwater. 
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The problem of properly assessing the time needed for ultimate biodegrada- 
tion can be simplified to the problem of determining the half-life time of that 
process. However, very few measured data exist and even these data are not ta- 
ken under controlled conditions. It follows that an objective and comprehensive 
database on biolysis half-life times can not be found easily. The best we were able 
to find was in a handbook of degradation rates m- The chemicals described in 
this handbook were used as the basis of our study. 

Usually, authors try to construct a QSAR model/formula for only one class 
of chemicals, or congeners of one chemical, e.g. phenols. This approach to QSAR 
model construction has an implicit advantage that only the variation with respect 
to the class mainstream should be identified and properly modelled. Contrary 
to the described situation, our database comprises several families of chemicals, 
e.g. alcohols, phenols, pesticides, chlorinated aliphatic and aromatic hydrocar- 
bons, acids, diverse other aromatic compounds, etc. From this point of view, the 
construction of adequate QSAR models/formulae is a much more difficult task. 

We apply several machine learning methods, including several inductive logic 
programming methods, to the above database in order to construct SAR/QSAR 
models for biodegradability. The remainder of the paper is organized as follows. 
Section 2 describes the dataset and how the representations used by the different 
machine learning systems were generated. Section 3 lists the representation and 
the machine learning systems employed, and describes the experimental setup. 
Section 4 presents the experimental results, including expert comments on some 
of the induced rules. Section 5 gives further discussion. Section 6 comments on 
related work, and Section 7 concludes and gives some directions for further work. 

2 The Dataset 

The database used was derived from the data in the handbook of degradation 
rates m- The authors have compiled from available literature the degradation 
rates for 342 widely used (commercial) chemicals. Where no measured data on 
degradation rates were available, expert estimation was performed. The main 
source of data employed was the Syracuse Research Corporation’s (SRC) Envi- 
ronmental Fate Data Bases (EFDB), which in turn used as primary sources of 
information DATALOG, CHEMFATE, BIOLOG, and BIODEG files to search 
for pertinent data. 

For each considered chemical the book contains degradation rates in the form 
of a range of half-life times (low and high estimate) for overall, biotic and abiotic 
degradation in four environmental compartments, i.e., soil, air, surface water and 
ground water. We focus on surface water here. The overall degradation half-life 
is a combination of several (potentially) present pathways, e.g., surface water 
photolysis, photooxydation, hydrolysis and biolysis (biodegradation). These can 
occur simultaneously and have even synergistic effects, resulting in a half-life 
time (HLT) smaller than the HLT for each of the basic pathways. We focus on 
biodegradation here, which was considered to run in unacclimated aqueous con- 
ditions, where biota (living organisms) are not adapted to the specific pollutant 
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considered. For biodegradation, three environmental conditions were considered: 
aerobic, anaerobic, and removal in waste water treatment plants (WWTP). In 
our study we focus on aqueous biodegradation HLT’s in aerobic conditions. 

The target variable for machine learning systems that perform regression was 
the natural logarithm of the arithmetic mean of the low and high estimate of 
the HLT for aqueous biodegradation in aerobic conditions, measured in hours. 

A discretized version of the arithmetic mean was also considered in order 
to enable us to apply classification systems to the problem. Four classes were 
defined : chemicals degrade fast (mean estimate HLT is up to 7 days), moderately 
fast (one to four weeks), slowly (one to six months), or are resistant (otherwise). 

From this point on, we proceeded as follows. The CAS (Chemical Abstracts 
Service) Registry Number of each chemical was used to obtain the SMILES |22j 
notation for the chemical. In this fashion, the SMILES notations for 328 of the 
342 chemicals were obtained. 

The SMILES notation contains information on the two-dimensional structure 
of a chemical. So, an atom-bond representation, similar to the representation 
used in experiments to predict mutagenicity, can be generated from a SMILES 
encoding of a chemical. A DCG-based translator that does this has been written 
by Michael de Groeve and is maintained by Bernhard Pfahringer. We used this 
translator to generate atom-bond relational representations for each of the 328 
chemicals. Note that the atom-bond representation here is less powerful than the 
QUANTA-derived representation, which includes atom charges, atom types and 
a richer selection of bond types. Especially the types carry a lot of information 
on the substructures that the respective atoms/bonds are part of. 

A global feature of each chemical is its molecular weight. This was included 
in the data. Another global feature is logP, the logarithm of the compound’s 
octanol/water partition coefficient, used also in the mutagenicity application. 
This feature is a measure of hydrophobicity, and can be expected to be important 
since we are considering biodegradation in water. 

The basic atom and bond relations were then used to define a number of 
background predicates defining substructures / functional groups that are pos- 
sibly relevant to the problem of predicting biodegradability. These predicates 
are: nitro {—N02)^ sulfo {—SO 2 or —O — S — O 2 ), methyl {—CH^), methoxy 
(— O — CHf), amine, aldehyde, ketone, ether, sulfide, alcohol, phenol, carboxy- 
lic acid, ester, amide, imine, alkylTialide (R-Halogen where R is not part of 
a resonant ring), arTialide (R-Halogen where R is part of a resonant ring), 
epoxy, n2n (— iV = N-), c2n (— C = IV—), benzene (resonant Cq ring), he- 
tero_ar_6_ring (resonant 6 ring containing at least 1 non-C atom), non_ar_6c_ring 
(non-resonant Cq ring), non_ar_hetero_6_ring (non-resonant 6 ring containing at 
least 1 non-C atom), six_ring (any type of 6 ring), carbon_5_ar_ring (resonant Cs 
ring) non_ar_5cjring (non-resonant ring), non_ar_hetero_5_ring (non-resonant 
5 ring containing at least 1 non-C atom), and five_ring (any type of 5 ring). 
Each of these predicates has three arguments: MoleculelD, MemberList (list of 
atoms that are part of the functional group) and ConnectedList (list of atoms 
connected to atoms in MemberList, but not in MemberList themselves). 



3 Experiments 

3.1 Representations 
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Molecular weight, logP and the abovementioned predicates form the basic re- 
lational representation (denoted by Rl) considered in our experiments. Two 
propositional representations were derived from this. The first one (denoted PI) 
has an attribute fgCount for each three-argument predicate fg of the backgro- 
und knowledge, which is the number of distinct functional groups of type fg 
in a molecule. Including logP and molecular weight, this representation has 31 
attributes. 

The second propositional representation (denoted P2) has been derived by 
counting all substructures of two and three atoms plus all four-atom substruc- 
tures of a star-topology (no chains) . Substructures that appear in at least three 
compounds (59 of them) are taken into account. For each such substructure 
we have a feature counting the number of distinct substructures of that kind 
in a molecule. The second propositional representation also includes logP and 
molecular weight. 

Many of the functional groups have been selected from the PTE (predictive 
toxicology evaluation) domain theory where the task is to predict carcino- 
genicity of chemicals. In this domain, the approach of Dehaspe and Toivonen pj 
to discover (count) most frequent substructures that occur in the dataset and 
use these in conjunction with propositional learners has been among the most 
successful. Our small substructure representation has been derived along these 
lines. 



3.2 Systems 

A variety of classification and regression systems were applied to the classifi- 
cation, respectively regression version of the biodegradability problem. Propo- 
sitional systems were applied to representations PI and P2. For classification, 
these were the decision-tree inducer C4.5 P) and the rule induction program 
RIPPER |Hj. For regression, the regression-tree induction program MS’ P, a 
re-implementation of MS (Hj was used. It can construct linear models in the 
leaves of the tree. 

Relational learning systems applied include ICL pj, which induces classifi- 
cation rules, SRT PI and TILDE The latter are capable of inducing both 
classification and regression trees. ICL is an upgrade of CN2 [E|| to first-order 
logic, TILDE is an upgrade of C4.5, and SRT is an upgrade of CART pj. TILDE 
cannot construct linear models in the leaves of its trees; SRT can. 

Finally, FFOIL pj was also applied to the classification version of the 
problem. It used a representation (denoted R2) based on the atom and bond 
relations, designed to avoid problems with indeterminate literals. New predi- 
cates are introduced for conjunctions of the form atom{M, X, Elementl, 
bond{M, X,Y, BondType),atom{M,Y, Element2, J). E.g., o2s{M,X,Y) 

stands for atom{M, X, o, _, d),bond{M, X, Y, 2),atom{M, Y, s, _, _). 



84 



S. Dzeroski et al. 



Table 1. Performance of machine learning systems predicting biodegradability. 



System Representation Accuracy Accuracy (+/-1) Correlation (r) 



C4.5 


PI 


55.2 


86.2 


- 


C4.5 


P2 


56.9 


82.4 


- 


RIPPER 


PI (-SO) 


52.6 


89.8 


- 


RIPPER 


P2 


57.6 


93.9 


- 


M5’ 


PI 


53.8 


94.5 


0.666 


M5’ 


P2 


59.8 


94.7 


0.693 


FFOIL 


R2 


53.0 


88.7 


- 


ICL 


R1 


55.7 


92.6 


- 


SRT-C 


PI 


51.3 


88.2 


- 


SRT-C 


Pl-kRl 


55.0 


90.0 


- 


SRT-R 


PI 


49.8 


93.8 


0.580 


SRT-R 


Pl-tRl 


52.6 


93.0 


0.632 


TILDE-C R1 


51.0 


88.6 


- 


TILDE-C Pl-tRl 


52.0 


89.0 


- 


TILDE-R R1 


52.6 


94.0 


0.622 


TILDE-R Pl-tRl 


52.4 


93.9 


0.623 



BIODEG 0.607 



Regarding parameter settings, default settings were employed for all systems 
wherever possible. Deviations from default parameter settings will be mentioned 
where appropriate in the results section. 



3.3 Evaluation 

Performance on unseen cases was estimated by performing five 10-fold cross- 
validations. The same folds were used by all systems. Performances reported are 
averages over the 5 cross-validations. Some of the induced models were inspected 
by B. Kompare, acting here as a domain expert, who provided some comments 
on their meaning and agreement with existing knowledge in the domain. 

For the regression systems, correlation between the actual and predicted va- 
lues of the log mean half-time of aerobic aqueous biodegradation is reported. We 
also measure classification accuracy (as described below) achieved by discretizing 
the real-valued predictions. 

For the classification systems, classification accuracy is reported. We are 
dealing with ordered class values and misclassification of, e.g., fast as slow is 
a bigger mistake than misclassification of fast as moderate. We thus also record 
accuracy where only misclassification by more than one class up or down counts 
as an error (e.g., fast as slow, or resistant as moderate). This is denoted as 
Accuracy (-I-/-1) in Table 1. 

4 Results 

Table 1 gives an overview of the performance of the different classification and 
regression systems as applied to the problem of predicting biodegradability. SRT- 
C denotes SRT used to learn classification trees, while SRT-R denotes SRT used 




Experiments in Predicting Biodegradability 



85 



resistant logP>=4.91, ’C[H]’<=15 (27/4). 

’/, Nonpolar (hydrophobic) compounds degrade less readily 
resistant ’C[C1]’>=3, mweight<=165 . 834 (7/1). 

’/. Halogenated compounds are resistant 
fast mweight<=110.111, ’D[H]’>=1 (18/4). 

’/. Alcohols (alkyl -OH) are fast to degrade 
fast mweight<=108 . 096 , ’C=0’>=1 (15/7). 

’/. C=0 readily degrades 
slow ’N=0’>=1, mweight<=130.19 (10/0). 

’/. Compounds with N(-)0 degrade slowly 
slow logP>=1.52, ’C[H]’<=5 (31/16). 
slow ’CN’>=1, logP>=1.7, mweight>=249 . 096 (11/3). 

’/. Very heavy and possibly toxic 
slow ’C=0’<=0, mweight>=121.182, ’CN’>=1 (23/15). 
default moderate (85/51) . 



Fig. 1. Rules for predicting biodegradability induced by RIPPER. 

to learn regression trees. TILDE-C and TILDE-R have similar meaning. The 
first column lists the system applied, the second the representation used. The 
second column also lists some parameters changed from their default values. The 
representations are described in Section 3.1 (P1,P2, Rl) and 3.2 (R2). The next 
three columns list performance measures as described in Section 3.3. 

C4.5 was used on the two different propositional representations. Better 
performance was achieved using P2. Default parameters were used. The trees 
generated were too bushy for expert inspection. C4.5 performs worst in terms 
of large misclassification (e.g. fast as slow) errors, i.e. in terms of the measure 
Accuracy (+/-1). 

RIPPER achieves highest accuracy of the classification systems applied. 
With its default parameters RIPPER prunes drastically, producing small rule 
sets. The rule set derived from the entire dataset for representation P2 is given 
in Figure 1, together with some comments provided by our domain expert. 

The expert liked the rule-based representation and the concise rules very 
much (best of the representations shown to him, which included classification 
and regression trees induced by M5’, SRT and TILDE, as well as clausal theories 
induced by ICL). The rules make sense, but are possibly pruned too much and 
cover substantial numbers of negative examples. 

Pruning was then turned down in RIPPER (option -SO), producing larger sets 
of longer rules, at a moderate loss of accuracy. The accuracy for representation 
P2 is in this case 54.8 % (again estimated by doing five 10-fold cross-validations). 

MS’ achieves best results among the systems applied in terms of both regres- 
sion accuracy (almost 0.7) and classification accuracy (almost 60 %, respectively 
95 %). M5’ was used with pruning turned down (-fO.O), as this seemed to perform 
best in terms of accuracy. Linear models are by default allowed in the leaves of 
the trees. Trees generated with these settings were too large to interpret. 

Trees were generated from the entire dataset with more intensive pruning to 
ensure they were of reasonable size for interpretation by the domain expert. The 
tree generated from representation P2 is shown in Figure 2. The setting -fl.2 
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logP <= 4.005 

I mweight <= 111.77 

I I ’D[H]’ <= 0.5 LMl (80/49.7“/.) 

I I ’D[H]’ > 0.5 LM2 (22/50.7“/.) 

I mweight > 111.77 

I I >C=Q> <= 0.5 LM3 (112/65.4“/.) 

I I »C=0’ > 0.5 

I I I ’CO’ <= 1.5 

I I I I ’CN[H] ’ <= 1.5 

I I I I I ’C[C1] ’ <=1.5 LM4 (7/0“/.) 

I I I I I ’C[C1]’ > 1.5 LM5 (2/6.68“/.) 

I I I I ’CN[H]’ > 1.5 LM6 (9/33.8"/.) 

I I I ’CO’ >1.5 

I I I I ’C[H]’ <= 12.5 

I I I I I ’N[H]’ <= 0.5 

I I I I I I ’CO’ <= 2.5 LM7 (5/0"/.) 

I I I I I I ’CO’ > 2.5 LM8 (10/46.1“/.) 

I I I I I ’N[H]’ > 0.5 LM9 (5/16.37.) 

I I I I ’C[H]’ > 12.5 

I I I I I logP <= 2.26 LMIO (5/0"/.) 

I I I I I logP > 2.26 LMll (4/2.42’/.) 

logP > 4.005 

I logP <= 4.895 LM12 (27/53.9’/.) 

I logP > 4.895 

I I ’C[H]’ <= 15.5 LM13 (31/55’/.) 

I I ’C[H]’ > 15.5 LM14 (9/45.9“/.) 

Linear models at the leaves: 

Unsmoothed (simple) : 

LMl: class = 6.1 + 0.525’C[C1]’ + 0.618’CN’ - 1.09’C=0’ - 0.559’CN[H]’ 

LM2: class =4.71 

LM3: class = 7.38 - 0. 00897mweight + 0.889’C[Br]’ + 0.576’C[C1]’ + 0.522’CN’ + 0.113’N=0 

LM4: class =6.04 

LM5: class = 6.7 

LM6: class = 9.83 - 1.8’N[H]’ 

LM7: class =4.56 

LM8: class = 5.6 

LM9 : class =6.15 

LMIO: class =6.04 

LMll: class = 6.52 - 0.252’0[H]’ 

LM12: class = 6.77 + 0.182’C[C1]’ - 0.357’C0’ 

LM13: class = 9.43 - 1.52’CN’ 

LM14: class = 12.2 - 0 . 0157mweight 



Fig. 2. Regression tree for predicting biodegradability induced by MS’. 



was used for pruning. The numbers in brackets denote the number of examples 
in a leaf and the relative error of the model in that leaf on the training data: 
LMl was constructed from 80 examples and has 49.7 % relative error on them. 

Unsurprisingly, the most important feature turns out to be logP, the hydro- 
phobicity measure. For compounds to biodegrade fast in water, it helps if they 
are less hydrophobic. When a compound is not very hydrophobic (logP < 4.005), 
molecular weight is an important feature. With relatively low molecular weight 
(< 111.77), the presence of an —OH group indicates smaller half-life times. With 
no —OH groups (LMl), halogenated compounds degrade more slowly and so do 
compounds with ON substructures (positive coefficients in LMl). This is also 
consistent with the expert comments on the RIPPER rules. 

FFOIL uses the R2 representation (Section 3.2). The settings -dlO and -a65 
were used; -dlO allows the introduction of ’’deeper variables” (this does not seem 
to have any impact), and -a65 means that a clause must be 65% correct or better 
(FFOIL’s default is 80 %, which seems too demanding in this domain). 




Experiments in Predicting Biodegradability 



87 



activ(A,B) 

carbon_5_ar_ring(A,C,D) ? 

+ — yes: [9.10211] % Aromatic compounds are relatively slow to degrade 

+ — no: aldehyde (A, E,F) ? 

+ — yes: [4.93332] % Aldehydes are fast 

+ — no: atm(A ,G,h,H, I) ? 7, If H not present should degrade slowly 

+ — yes: mweight(A,J) , J =< 80 ? 

+ — yes: [5.52184] % Low weight ones degrade faster 
+ — no: ester(A,K,L) ? 7o Esters degrade fast 

+ — yes:mweight(A,M) , M =< 140 ? 

I +— yes: [4.93332] 

I 

+ — no : 



+— no: [5.88207] 
mweight(AjN) , N =< 340 ? 

+ — yes : carboxylic_acid(A ,0 ,P) ? 
+— yes: [5.52288] 

+ — no: ar_halide(A,Q,R) 



I 



-yes 



alkyl_halide(A,S,T) 
+— yes: [11.2742] 
+— no: [7.81235] 
phenol(A,U,V) ? 

+ — yes :mweight (A,W) 



% Acids degrade fast 
7o Halogenated - slow 



W =< 180 ? 



+— yes: [4.66378] 
+— no: [7.29547] 
[6.86852] 



mweight(A,X) 

+— yes: [6.04025] 
+— no: [8.55286] 



+— no: [8.28685] 
X =< 100 ? 



Fig. 3. A regression tree for predicting biodegradability induced by TILDE. 



FFOIL only uses the atom and bond relations, molecular weight and logP, 
but not the functional group relations/predicates. On the entire dataset, FFOIL 
induces 54 rules. It is interesting that some of these rules use negation. The rule 
activity(A,fast) :-mw(A,C) , logp(A,D), not (del (A, _1 , _2) ) , C>104.151, 
D>1.52, C <=129.161, D<=3.45, ! . states that a compound A degrades fast if 
it is not halogenated, is relatively light, and relatively nonhydrophobic. 

ICL was applied to representation Rl. In terms of accuracy, it achieves better 
results than all other systems not using P2, and in terms of Accuracy (+/-1) 
it performs better than all classification systems except RIPPER on P2. The 
theory induced from the entire dataset contains 87 rules. 

An example rule is: moderate (M) atom(M,Al,Eleml,_,_) , Eleml = s, 
mweight (M,MW) , It (MW, 190) ,gt (MW, 90) . It states that a compound with a sul- 
phur atom and molecular weight between 90 and 190 degrades moderately fast. 
The expert comments that sulphur slows down biodegradation. 

Another rule states that a compound is fast to degrade if it contains a benzene 
and a phenol group and is lighter that 170. The expert comments that in this 
case degradability is probably due to hydrolisis and photolysis. 

SRT upgrades CART to a relational representation, as mentioned above. 
From CART it inherits error-complexity pruning. It can construct linear models 
in the leaves and extends CART methodology by cross-validating these models. 
No linear models in the leaves were allowed in the experiments reported here. 

The SRT results were not obtained by using default settings. Results for 
unmodified error-complexity pruning were not competitive. We thus forced SRT 
to overfit: from the sequence of pruned trees ordered by increasing complexity 
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we took the first tree after the most accurate tree that was within one standard 
error of the former. The resulting trees were too large for inspection. 

Both a propositional (PI) and a relational representation (Pl+Rl) were used. 
Adding the relational information improves accuracy, the greatest jump being 
observed for classification accuracy of SRT-C. Using P2 in addition (P1+P2+R1) 
only improves the regression results marginally. SRT-C is better than SRC-R on 
accuracy, but worse on Accuracy (-I-/-1). 

TILDE was used for both classification and regression, once using R1 and 
once using Pl-I-Rl. TILDE-C was used with default settings. TILDE-R was used 
with its ftest parameter set to 0.01, which causes maximal pre-pruning. 

The use of PI in addition to R1 does not change the performance of TILDE. 
Better performance is achieved with regression, not in terms of Accuracy but 
in terms of Accuracy (-I-/-1). Using P2 in addition (P1-I-P2-I-R1) yields worse 
regression results (r=0.58). 

An example regression tree induced by TILDE-R from the entire dataset 
is given in Figure 3. This tree has actually been generated without using logP 
information. It was analysed and commented upon by the domain expert. The 
fact that it does not use logP actually makes it easier for the influence of the 
functional groups on biodegradability to be identified. Namely, when logP is 
used, a large part of the tree uses logP only. Some of the expert comments are 
given in the tree itself. 

5 Discussion 

Overall, propositional systems applied to representation P2 yield best perfor- 
mance. M5’ on this representation yields the highest overall accuracy. Accuracy 
(-I-/-1) and correlation. RIPPER follows with the second best classification accu- 
racy and Accuracy (+/-1) matched only by TILDE-R. Of the relational learning 
systems, ICL performs best with highest classification accuracy and Accuracy 
(-I-/-1) comparable to that of SRT-R and TILDE-R. 

Regression systems perform better than classification ones. This does not 
clearly show when one looks at accuracy alone, but it becomes clearer when one 
looks at Accuracy (-I-/-1). It thus seems that regression problems can best be 
handled by regression systems. 

Using relational information in addition to the propositional formulation PI 
does not bring drastic improvements. SRT and TILDE perform slightly better 
or the same on PI -|- R1 as compared to PI. SRT and TILDE used for regression 
on PI -I- RI still perform (slightly) worse than M5’ on PI. The reason for this 
might be the fact that MS’ was using linear regression in the leaves, while SRT 
and TILDE were not. 

Note that the propositional representations PI and P2 contain structural 
features derived 1) directly from the functional group relations and 2) from 
the atom and bond relations. These features count occurrences of substructures 
within compounds. PI contains definitions of both small and larger groups (such 
as rings), while P2 mainly contains small structures (up to 4 atoms). 
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The biodegradation rates used in this study were expert estimates rather than 
measurements for the most part. We have thus been modeling expert opinions 
on biodegradation rates, and not biodegradation rates themselves. This means 
that we have also modeled expert estimation errors. To the authors’ knowledge, 
only small datasets containing measured biodegradation rates for structurally 
related chemicals are publicly available at present. 

6 Related Work 

Related work includes QSAR applications of machine learning and ILP, on one 
hand, and constructing QSAR models for biodegradability, on the other hand. 
On the ILP side, QSAR applications include drug design (e.g. mutagenicity 
prediction (e.g. ESI), and toxicity prediction IZDI. The latter two are closely 
related to our application. In fact, we have used a similar representation and 
reused parts of the background knowledge developed for them. 

On the biodegradability side, m is closest to our work. The last row of Table 
1, marked BIODEG, gives the correlation between the actual values of the conti- 
nuous class and predictions made by the BIODEG program m The correlation 
is calculated for all 328 chemicals in our database, since the BIODEG program 
has been derived independently. This program estimates the probability of rapid 
aerobic biodegradation in the presence of mixed populations of environmental 
organisms. It uses a model derived by linear regression |TT|. 

The best results of our experiments (correlation of 0.7) are considerably bet- 
ter than the BIODEG program predictions (correlation 0.6). Furthermore, while 
the reported performance results for the machine learning systems are for un- 
seen cases, some of the 200 chemicals used in developing BIODEG also appear in 
our database. In im, GAS numbers for 144 of the 200 chemicals used to derive 
BIODEG are provided; of these, 21 also appear in our database. The correlation 
of BIODEG predictions is thus probably even lower than 0.6 for unseen cases. 

Work on applying machine learning to predict biodegradability includes im, 
who compared several AI tools on the same domain and data and found these 
to yield better results than the classical statistical and probabilistic approaches, 
tz8l4i who applied neural nets, and |Sj who applied several different approaches. 

7 Conclusions and Further Work 

Predicting biodegradability is a QSAR problem, similar to predicting mutageni- 
city or toxicity. Based on a handbook of biodegradation rates, we have developed 
a relational dataset including a structural representation of compounds and back- 
ground knowledge on potentially relevant substructures. This dataset is suitable 
for both propositional and relational learning. Particular attention was paid to 
data quality issues: many datasets of this kind have surprisingly many errors, 
such as incorrect SMILES codes, which essentially result in incorrect descripti- 
ons of the compounds and affect the resulting QSAR theories accordingly. The 
dataset itself is thus a contribution on its own. 
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We have applied a range of machine learning systems, including ILP systems, 
to several representations derived from the relational description of the compo- 
unds. Best performance was achieved on good propositionalisations derived by 
counting substructures. This is in agreement with, e.g., the predictive toxicology 
evaluation results (cf. this volume) where best results were achieved by pro- 
positional systems using relational features representing the presence/count of 
frequent substructures. 

M5’, which achieves the best results, outperforms an approach derived by 
biodegradability experts, implemented in the program BIODEG. The theories 
induced by the machine learning systems were easy to interpret (size permitting) 
and made sense to the domain expert. Given that the biodegradation rates that 
we used as values of the target variable are mostly estimates and not measured 
values, overall performance is satisfactory. 

There is a variety of directions for further work. One possibility is to study 
overall degradation and biodegradation comparatively. Identifying chemicals for 
which degradation and biodegradation time differ is an important topic. Gha- 
racterising such chemicals would be an interesting learning problem. 

Another important issue is how performance is evaluated when only estimates 
of the target variable are provided. One could argue that if the learned theory 
predicts a value which is between the low and high estimate provided by an 
expert, its prediction is correct. In a sense, we may have applied a too strict 
evaluation criterion here, trying to fit the log mean half-life time, while providing 
a value in the provided interval may have been sufficient. 

Predicting the logarithm of the mean of the low and high estimates of the 
degradation rate is close to predicting the logarithm of the high estimate. Pre- 
dicting the (logarithm of the) low estimate and combining the two predictions 
might yield better results. This should also be investigated in further work. 
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Abstract. In this paper we present IBC, a first-order Bayesian Classi- 
fier. Our approach is to view individuals as structured terms, and to dis- 
tinguish between structural predicates referring to subterms (e.g. atoms 
from molecules), and properties applying to one or several of these sub- 
terms (e.g. a bond between two atoms). We describe an individual in 
terms of elementary features consisting of zero or more structural predi- 
cates and one property; these features are considered conditionally inde- 
pendent following the usual naive Bayes assumption. IBC has been im- 
plemented in the context of the first-order descriptive learner Tertius, 
and we describe several experiments demonstrating the viability of our 
approach. 



1 Introduction 

In this paper we present IBC, a first-order Bayesian Classifier. While the propo- 
sitional Bayesian Classifier makes the naive Bayes assumption of statistical in- 
dependence of elementary features (one attribute taking on a particular value) 
given the class value, it is not immediate which elementary features to use in the 
first-order case, where features may be constructed from arbitrary numbers of 
literals. A classification task consists in classifying new individuals given some 
examples. It requires therefore a clear notion of individuals. Our approach is 
to view individuals as structured terms, and to distinguish between structural 
predicates referring to subterms (e.g. atoms from molecules), and properties ap- 
plying to one or several of these subterms (e.g. a bond between two atoms). An 
elementary first-order feature then consists of zero or more structural predicates 
and one property. 

In section 2, we briefly recall the main ideas behind the propositional naive 
Bayesian classifier, and discuss possible ways to upgrade it to a first-order lan- 
guage. In section 3, we describe the first-order language used in IBC. Section 
4 gives some implementation details, and section 5 describes our experiments 
on 3 well-known datasets from Inductive Logic Programming (ILP). Section 6 
discusses the main implications of our approach. 

2 The Naive Bayesian Classifier 

Like any learner, the naive Bayesian classifier manipulates descriptions of indi- 
viduals. The classical naive Bayesian classifier uses an attribute-value language. 
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a representation formalism that is commonly used in machine learning. Logi- 
cally speaking an attribute-value language can be mapped to unary predicate 
logic, where hypotheses use a single universally quantified variable to express 
generalisation over all individuals, and examples are variable-free conjunctions 
concerning single individuals. Whereas this representation is, strictly speaking, 
not propositional, a learning system keeping track of the distinction between ex- 
amples and hypotheses can actually drop the syntactic distinction and express 
both in a variable-free, essentially propositional formalism (the single represen- 
tation trick). In Section 2.1 we recall the propositional naive Bayesian classifier. 
In Section 2.2 we discuss the general problem of upgrading it to deal with non- 
propositional representations. 



2.1 The Propositional Case 

Let Ai,l < i < n, be a set of attributes, and let Cl be the class attribute. 
Given that an individual takes on the values a\ .. .an for attributes in 

a Bayesian approach the most likely class value c is the one that maximises 



P{c\ai ...an) = 



P{ai . . .an\c)P{c) 



( 1 ) 



P{ai .. .ttn) 

Here we write P(ai) as an abbreviation for P(Ai = Ui). 

In order to decrease the number of probabilities involved in this calcula- 
tion, and to increase the reliability of their estimates, usually the simplify- 
ing naive Bayes assumption is made that P(ai . . .a„|c) = P(ai\c) . . .P(a„|c), 
i.e. the values taken on by the different attributes are conditionally indepen- 
dent given the class value. The predicted class value c is the one that maximises 
P{c)P{ai\c)...P{on\cy. 



argmaXc 



P{c)J{P{ayc) 



(2) 



(For a given individual the term P{a\ . . . a„) is a constant normalising term that 
can be ignored if we’re only interested in determining the most likely class value.) 

The classifier which predicts by maximising the above expression is called the 
naive Bayesian classifier, or Bayesian elassifier for short. Essentially, it reads the 
description of an individual to be classified, and then tries to estimate how likely 
it is to observe such an individual among each of the possible classes. Thus, the 
fundamental problem of a Bayesian classifier (naive or otherwise) is to estimate 
how likely it is to observe an individual satisfying a particular description among 
given sub-populations. In our case these estimates are obtained from the training 
set, under the naive Bayes assumption of conditional independence. Even in cases 
where this assumption is clearly invalid, the Bayesian classifier has been shown 
to give good results [5]. 



2.2 The First-Order Case 

Upgrading the Bayesian classifier to first-order representations requires a per- 
spective on how exactly learning in first-order logic generalises attribute-value 
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learning. Recently, a number of such perspectives have been proposed [1, 17, 10]. 
Each of these approaches make certain assumptions on what an individual is, 
and how it is represented. Our approach is best understood by thinking of indi- 
viduals as structured objects represented by first order terms in a strongly typed 
language [6]. Our actual implementation makes use of a flattened, function-free 
Prolog representation, as explained in Section 3. 

As in the propositional case, we will assume that the domain provides a well- 
defined notion of an individual, e.g. a patient in a medical domain, a molecule in 
mutagenicity prediction, or a board position in chess. To each individual is asso- 
ciated its description (everything that is known about it except its classification) 
and its classification. In a first-order representation the description of an individ- 
ual can be expressed by a single structured term. In the attribute-value case this 
term is a tuple (element of a cartesian product) of attribute values (constants) . 
For instance, in a medical domain each patient could be represented by a five- 
tuple specifying name, age, sex, weight, and blood pressure of the patient. The 
first-order case generalises this by allowing other complex types at the top-level 
(e.g. sets, lists), and by allowing intermediate levels of complex subtypes before 
the atomic enumerated types are reached. 

Example 1. Consider Michalski’s east- and westbound trains learning problem. 
We start with a number of propositional attributes: 

Shape = {rectangle, u_shaped, bucket, hexa, . . .} 

Length = {double, short} 

Roof = {flat, jagged, peaked, arc, open} 

Load = {circle, hexagon, triangle, . . .} 

A car is a 5-tuple describing its shape, length, number of wheels, type of roof, 
and its load: 

Car = Shape x Length x Integer x Roof x Load 
And finally we define a train as a set of cars: 

Train = 2'^“’’ 

Here is a term representing a train with 2 cars: 

{(u_shaped, short, 2, open, triangle), (rectangle, short, 2, flat, circle)}. 

In this example an individual is represented by a set of tuples of constants, rather 
than by a tuple of constants as in the propositional case. Notice that the complex 
type set of tuples leads to what has been called the multiple-instanee problem 
[3] . It has been argued that the multiple instance problem represents most of the 
complexity of upgrading to a first-order representation [1]. However, we would 
like to stress that the above representation does not prevent a deeper nesting of 
types. 

If we want to apply a Bayesian classifier to Example 1, we need a general 
mechanism to estimate the probability of observing an arbitrary set of tuples. 
Following a naive Bayesian approach we want to deeompose the probability of a 
set, for instance into the probability of each of its members occurring separately. 
In our example, we would consider the event of one tuple occurring in the set 
independent of the events of other tuples occurring in it (given the class value) . 
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For example, the probability of the above train occurring among eastbound trains 
could be assessed by estimating the probability of a u_shaped short open car 
with 2 wheels whose load is a triangle occurring in eastbound trains and the 
probability of a rectangle short car with a flat roof whose load is a circle 
occurring in eastbound trains. Notice however that this is not the only way 
to decompose a probability distribution over sets. For instance, the number of 
elements of the set may be governed by a separate distribution. 

Now, how are the probabilities of particular cars estimated? Again, there 
are several possibilities. One possibility is to decompose again, using the propo- 
sitional naive Bayesian decomposition over tuples. Another possibility is not 
to decompose cars, treating them as atomic terms instead whose probability 
should be estimated directly from the training set. Our approach is flexible, al- 
lowing the user a maximum depth until which the term can be decomposed. 
However, in this paper we will focus on a recursive decomposition until the 
deepest level. That is, the probability of the above train occurring among east- 
bound trains is assessed by estimating the probabilities P{ushaped\eastbound), 
P(rectangle\eastbound) , P(short\eastbound) , P(2\eastbound), P(open\eastbound), 
P(flat\eastbound), P(triangle\eastbound), and P(circle\eastbound). 

In addition, we consider the negation of properties not satisfied by any car 
in the train. Indeed, the probability of the set {b,c} as a subset of {a,b,c} 
can be assumed inversely proportional to the probability of a. Therefore the 
probability of features not occurring in the previous train are considered, such as 
P(-<diamond\eastbound), P(-<long\eastbound) , and P (-<3\eastbound) , provided 
some trains in the training set contain diamond-shaped cars, long cars, or cars 
with 3 wheels. The probability of the observed train is then approximated by 
the product of the aforementioned probabilities. 

To summarise the discussion so far, the fundamental problem of a first-order 
Bayesian classifier is how to decompose a probability distribution over a com- 
plex type possibly involving several levels of nesting. The main difference with 
the propositional case is that there are a number of ways of approaching this 
problem, none of which seem a priori preferable. In the next section we describe 
our approach, which is a transformation approach in the sense that we use a 
flattened, function-free representation instead of the above term-based represen- 
tation. The reason for this representation change is that IBC is implemented on 
top of an existing system, implemented in C, which uses a flattened representa- 
tion. Using a flattened representation means that we have to ‘emulate’ complex 
types such as tuples, lists, and sets. In fact, we will only be concerned with 
tuples, emulated by so-called structural functions, and sets, emulated by non- 
determinate structural predicates, as explained in the next section. Handling 
other non-determinate types such as lists is left for future work. 

It is important not to confuse the flattening transformation with proposition- 
alisation as e.g. done by LINUS [9]. Since individual-based representations can 
always - conceptually - be propositionalised, the distinguishing characteristic of 
propositionalisation approaches is that they transform the first-order data into 
propositional form. In contrast, IBC operates directly on the first-order data. 
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3 A First-Order Language for Bayesian Classifiers 

In this section we extend the representation formalism employed by the Bayesian 
classifier. We employ Prolog notation, with variables starting with capitals and 
constants starting with lowercase letters, and commas indicating conjunction 
between literals. 

3.1 Structural Predicates and Properties 

As explained above, a domain is thought of as a hierarchy of complex types. 
Instead of specifying this type hierarchy directly, the teacher associates with 
each complex type a structural predicate that can be used to refer to some of 
its subterms. 

Definition 1. A structural predicate is a binary predicate associated with a 
complex type representing the relation between that type with one of its parts. 

For instance, for an n-dimensional cartesian product, each of the n projection 
functions can be represented by a structural predicate (since it is a 1-to-l rela- 
tion, it is usually omitted). For a list type and a set type, list membership and 
set membership are structural predicates (notice that these are not functions, 
but 1-to-n relations). For example, the following conjunction of literals refers to 
the load L of some car C of a train T: train2car (T,C) ,car21oad(C,L). 

Definition 2. A functional structural predicate, or structural function, refers to 
a unique subterm, while anon-determinate structural predicate is non-functional. 

In the above example car21oad is functional and train2car is non-determinate. 
Our non-determinate structural predicates are similar to the structural literals 
of [17], however they are not required to be transitive. 

Definition 3. A property is a predicate characterising a subset of a type. 

A property is for instance the length of a car short (C), or the shape of a load 
load (L , triangle) . 

Definition 4. A parameter is an argument of a property which is always in- 
stantiated. If a property has no parameter ( or only one instantiation of its pa- 
rameters), it is boolean, otherwise it is multivalued. 

Our parameters correspond to valued arguments of [15]. We also assume that the 
value of parameters depends functionally on the instantiation of the remaining 
(relational [15]) arguments. 

Definition 5. A property is propositional if it has only one argument which is 
not a parameter, and relational if it has more than one. 

The property shape (C .rectangle) is propositional, while bond(Atoml , Atom2 , 1) 
is relational. 




IBC: A First-Order Bayesian Classifier 97 



3.2 Features 

Definition 6. An individual variable is a variable of the eomplex type deserihing 
the domain of interest. A first-order feature of an individual is a eonjunetion of 
struetural predieates and of properties where: 

— eaeh struetural predieate uses one of the variables already introdueed, and 
introduees a new variable, 

— properties only use the variables introdueed by struetural predieates or the 
individual variable, 

— all variables are used either by a struetural predieate or a property. 

For instance, the following is a first-order feature: train2car (T,C) ,car21oad(C,L) , 
load (L .triangle) . Note that all variables are existentially quantified except the 
individual variable, which is freed This feature is true of any train which has a 
car which has a triangular load. The condition train2car (T.C) is not a first- 
order feature, and neither is train2car(T,Cl) ,train2car(T,C2) .short(Cl) 
nor load (L .triangle) . 

Definition 7. A feature is funetional if all struetural predieates are funetional, 
otherwise it is non- determinate. 



Definition 8. A funetional feature is boolean if it eontains a single property and 
this property is boolean, otherwise it is multivalued. A non- determinate feature 
is always boolean. 

For instance, the feature train2f irstcar (T.C) . shape (C .Shape) , where Shape 
indicates a parameter, is multivalued. The first car of a given train has only one 
shape. On the other hand, the feature train2car (T.C) . shape (C. Shape) is 
boolean rather than multivalued. The same train can have a car with a rectan- 
gular shape, and another car with a non-rectangular shape. 

The distinction between structural predicates and properties is best under- 
stood in the context of a representation of individuals by terms: structural pred- 
icates refer to subterms (and introduce new variables) , properties treat subterms 
as atomic (and consume variables). However, the same distinction can be made 
when using a flattened representation, which is in fact what we use in the IBC 
system. Flattening requires introducing a name for all relevant subterms. For in- 
stance, the train of Example 1 could have the following flattened representation: 



train(tl) . 
train2car(tl.cl) . 
shape(cl .u_shaped) . 
length (cl. short ) . 
wheels(cl.2) . 
roof (cl .open) . 
car21oad(cl . 11) . 
load(ll .triangle) . 



train2car(tl.c2) . 
shape(c2. rectangle) . 
length ( c2 . short) . 
wheels (c2. 2) . 
roof (c2. flat) . 
car21oad(c2 . 12) . 
load(12 . circle) . 



^ In a proper first-order language this feature would be written as 
3 C.L: train2car (T.C) /\ C3.ir2 Xo cid ( C j L ^ /\ X 03.d ( L ^ 'tir 13 . 11 ^X 0 ^ . 
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Which predicates are structural and which are properties is part of the defi- 
nition of the hypothesis language, as it cannot always be detected automatically, 
especially when the representation is flattened and no types are defined. Let’s 
consider an example of the declarations required by IBC. 

— INDIVIDUAL 
train 1 train cwa 
—STRUCTURAL 

train2car 2 1: train * : car * cwa 
car2Ioad 2 1 : car 1 : load * cwa 
—PROPERTIES 
eastbound 1 train * cwa 
shape 2 car #shape * cwa 
short 1 car * cwa 
roof 2 car #kind * cwa 
wheels 2 car #nb_wheels * cwa 
load 2 load #I_shape * cwa 

The first two lines define a single individual variable of type train. The next 
three lines define the structural predicates. train2car is a many-to-one relation, 
indicating that one train may contain many cars, but each car belongs to ex- 
actly one train. This functions is a language bias, since rules like eastbound(Tl) 
train2car (T1 ,C) , train2car(T2,C) ,eastbound(T2) will not be consid- 
ered. Similarly, car2Ioad(C,L) indicates that each car contains exactly one 
load. Finally, the following lines indicate the properties. The second argument’s 
type of the second property #shape is preceded by #, indicating that it is a pa- 
rameter which must always be instantiated in rules. The label cwa means that 
the Closed- World Assumption is used for these predicates. 

These declarations are similar to mode declarations, albeit on the predicate 
level rather than the argument level. ILP systems such as Progol [11] and Warmr 
[2] use mode declarations such as train2car(+Train,-Car) indicating that the 
first argument is an input argument and should use a variable already occurring 
previously in the current hypothesis. They however do not distinguish between 
structural predicates and properties. 



3.3 Elementary Features 

From the naive Bayesian perspective, non-elementary features are those whose 
joint probability distribution is approximated from the distributions of the el- 
ementary features. Distinguishing between elementary and non-elementary fea- 
tures is crucial to the naive Bayesian approach. 

To understand the distinction between elementary and non-elementary first- 
order features, consider the following features: 

train2car (T,C) ,length(C, short) 

train2car (T,C) , roof (C, open) 

train2car (T,C) ,length(C, short) , roof (C, open) 
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The first feature is true of trains having a short car. The second feature is true 
of trains having an open car. The third feature is true of trains having a short, 
open car. From the naive Bayes perspective this third feature is non-elementary, 
as we assume (justifiably or otherwise) that the probability of a car being short 
is independent of the probability of a car being open, given the class value. 
Therefore our first-order Bayesian classifier needs to have access to the first and 
second feature, but not the third. 

Definition 9. A feature is elementary if it eontains a single property. 

Notice that properties may express relations between subterms, e.g. mol2atom(M, Al) , 
mol2atom(M,A2) , bond(Al,A2) would be an elementary first-order feature which 
describes the case of a molecule containing two atoms with a bond between them. 

4 Implementation 

IBC has been implemented in C in the context of the first-order descriptive 
learner Tertius [7]. Let us first describe briefly Tertius’ abilities which are 
used in IBC. 

Tertius is able to deal with extensional explicit knowledge (i.e. the truth 
value of all ground facts is given), with extensional knowledge under the Closed 
World Assumption (i.e. all true ground facts are given), or with intensional 
knowledge (i.e. truth values are derived using either prolog inference mechanism 
or a theorem prover). Tertius can also deal with (weakly) typed predicates, that 
is each argument of a predicate belongs to a named type and the set of constants 
belonging to one type defines its domain. Moreover, if a domain is continuous, 
Tertius allows one to discretise it into several intervals of one standard deviation 
and centered on the mean. 

Given some knowledge concerning the domain, Tertius returns a list of in- 
teresting sets of literals. It performs a top-down search, starting with the empty 
set and iteratively refines it. In order to avoid to consider the same clauses sev- 
eral times (and their refinements!), the refinement steps (i.e. adding a literal, 
unifying two variables, and instantiating a variable) are ordered. Once a partic- 
ular refinement step is applied, none of its predecessors are applicable anymore. 

The search space can be seen as a generalisation of set-enumeration trees [14] to 
first-order logic. 

Since there might be an infinite number of refinements, the search is restricted 
to a maximum number of literals and of variables. Other language biases are 
the declaration of structural predicates and properties, the distinction between 
functional and non-determinate structural predicates, and the use of parameters, 
as explained in the previous section. 

Elementary first-order features are generated by constraining Tertius to 
generate only hypotheses containing exactly one property and no unnecessary 
structural predicates. The features can optionally be read from a file. In Equa- 
tion 2, the features Ai = Oi are replaced by elementary first-order features /. 

Each conditional probability P{f\c) of the feature value / given the value c of 
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the class is then estimated from the training data. Writing n(/Ac) for the num- 
ber of individuals satisfying / and Cl = c, n(c) for the number of individuals 
satisfying Cl = c, and F for the number of values of the feature (the number of 
possible values for a multivalued feature, 2 for a boolean or a non-determinate 
feature), the Laplace estimate P{f\c) = is used in order to avoid null 

probabilities in the product in Equation 2. 



5 Experiments 

In this section we describe experimental results on mutagenesis, finite-element 
mesh design, and KRK-illegal. 



5.1 Mutagenesis 

This problem concerns identifying mutagenic compounds. We considered the “re- 
gression friendly” dataset and we used only the atom and bond structure of the 
molecule represented by a structural predicate atm(M, A) linking an atom A to its 
molecule M and four properties atomel(A,El) , atomty(A, AType) , atomch(A,Charge) 
and bond ( A 1 , A2 , BondType ) . 

IBC gets an accuracy of 80.9% on the training set, which is comparable to 
Progol’s accuracy of 79.8% but lower than regression’s accuracy of 89.9% [16]. 

IBC was also evaluated using a 3-fold cross-validation to be compared to the 
accuracy estimated on a randomly chosen test set of one third in [16]. The ac- 
curacy resulting from the 3-fold cross-validation is 74.5%. Progol and regression 
gave respectively 71.4% and 85.7%. 



5.2 Finite Element Mesh Design 

This domain is about finite element methods in engineering. The task is to 
predict how many elements should be used to model each edge of a structure 
[4]. The target predicate is mesh (Edge .Number) where the Number of elements 
in the Mesh model can vary between 1 and 17. Each edge is described by 
three multivalued properties type (Edge, Type), support (Edge, Support) and 
load (Edge, Load). Three structural predicates neighbour _xyjr (El ,E2) , 
neighbour _yzjr (El ,E2) and neighbour _zxjr (El ,E2) provide the functional 
representation of neighbour (El ,E2) which is necessary to define multivalued 
features. Structural predicates oppositejr (El ,E2) and equaljr (El ,E2) are also 
a functional representation of other topological relations. 

The accuracy achieved by IBC is 61.9% considering properties of the edge 
only, and 66.2% when considering features of the surrounding edges. This is 
lower than Golem’s accuracy of 84.9% [4] but it shows an improvement of the 
accuracy when topological properties are used. This is confirmed on the next 
domain. 
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5.3 Illegal Chess Endgame Positions 



The final experiment concerns the chess endgame domain White King and Rook 
vs. Black King (KRK) [12]. The classification task is to distinguish between 
illegal and legal board positions. The IBC representation employs a structural 
function board2whiteking to refer to the position of the White King (similarly 
for the other two pieces), and two structural functions pos2rank and pos2f ile 
to translate a position into rank and file. We have two propositional properties 
rankeq and f ileeq equating rank/file with a number, and three relational prop- 
erties adj, eq and It to compare rank/files. The propositional elementary fea- 
tures in this domain are exemplified by board2whiteking(A,B) ,rankeq(B,l) 
and board2blackking(A,B) ,f ileeq(B,8). The relational elementary features 
are of the form board2whiteking(A,B) ,board2whiterook(A,C) ,pos2rank(B ,D) , 
pos2rank(C ,E) ,eq(E,D). 

Following the results reported in [9], we used 5 training sets of 100 board 
positions each, and a test set of 5000 positions. Table 1 gives the accuracy over 
the training set, and the accuracy over the test set, averaged over the 5 training 
sets. IBC 2/2 refers to features with no more than 2 literals and 2 variables, i.e. the 
propositional features referred to above (in this case pos2file and pos2rank 
are used as properties with a parameter as second argument) . Similarly, IBC 5/5 
refers to features with up to 5 literals and 5 variables, which includes both the 
propositional and the relational features. IBC FO refers to relational features 
only (the two propositional properties rankeq and f ileeq were removed from 
the representation). 



Table 1. Results in the KRK-illegal domain. 



System 


[Training accuracy 


|Test accuracy 


Majority class 


64.0% 


sd. 3.0% 


66.3% 




MLC-I-I- 


79.0% 


sd. 3.1% 


57.0% 


sd. 2.6% 


IBC 2/2 


79.0% 


sd. 3.5% 


56.2% 


sd. 1.4% 


IBC 5/5 


91.2% 


sd. 2.5% 


84.3% 


sd. 5.2% 


IBC FO 


93.8% 


sd. 3.6% 


88.3% 


sd. 2.8% 



The results show that KRK-illegal is a difficult domain for a Bayesian classi- 
fier, since the best result reported in [9] was 98.1% on the test set, achieved by 
LINUS. Nevertheless, the experiment clearly demonstrates that the use of first- 
order features considerably improves the performance of the Bayesian classifier. 
With only propositional features, the result on the test set drops well below the 
majority class. This means that propositional features have actually negative 
information content [8]. To verify that this was not due to a bug in IBC, we also 
ran the Bayesian Classifier in MLC-F- 1- on the same data, and got virtually the 
same results for the propositional features only. 
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6 Discussion 



In this paper we presented a first-order Bayesian classifier. While many propo- 
sitional learners have been upgraded to first-order logic in ILP, the case of the 
Bayesian classifier poses a problem that has not been satisfyingly solved before, 
namely how to distinguish between elementary and non-elementary first-order 
features. Treating each Prolog literal as a feature, as is done in LINUS [9] is not 
a solution, because many literals do not contain a reference to an individual, 
and thus the relative frequency associated with that literal cannot be attributed 
to an individual. Our approach gives a clear picture of how an individual-based 
first-order representation upgrades attribute-value learning, namely by allowing 
relational and non-determinate features. In this respect, the work extends pre- 
vious work on the relationship between propositional and first-order learning [1, 
6,10,15,17]. 

One can argue that any individual-based first-order learning problem, as de- 
fined in this paper, can be transformed to attribute-value learning, by introduc- 
ing an attribute for any first-order feature in the hypothesis language. However, 
one should distinguish between transformation of the hypothesis language and 
transformation of the data (propositionalisation) . IBC does the first but not the 
second. Furthermore, the transformation of the hypothesis language is mostly 
conceptual: it is perfectly possible to explain how IBC operates on a multiple- 
instance problem by decomposing a probability distribution on sets. Thus, IBC 
clearly extends the propositional naive Bayesian engine. That being said, notice 
that IBC is able to consider the same features and is guaranteed to perform at 
least as well as a purely propositional Bayesian classifier, as was demonstrated 
in the KRK-illegal domain. This is a desirable property that is currently shared 
by only a few ILP systems. 

Pompe and Kononenko also describe an application of naive Bayesian classi- 
fiers in a first-order context [13]. However, in their approach the naive Bayesian 
formula is used in a post-processing step to combine the predictions of several, 
independently learned first-order rules. As far as we are aware, the present paper 
is the first to describe a first-order naive Bayesian learner. 

Future work includes refining the declarative bias specification of Tertius 
and IBC. We would also like to handle other non-determinate types such as lists. 
Finally, we would like to extend IBC to an interactive truth- value predictor, 
which could answer ground Prolog queries from an extensional database. 
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Abstract. Since its inception, the field of inductive logic programming 
has been centrally concerned with the use of background knowledge 
in induction. Yet, surprisingly, no serious attempts have been made to 
account for background knowledge in refinement operators for clauses, 
even though such operators are one of the most important, prominent 
and widely-used devices in the field. This paper shows how a sort the- 
ory, which encodes taxonomic knowledge, can be built into a downward, 
subsumption-based refinement operator for clauses. 



1 Introduction 

Since its inception, the field of inductive logic programming (ILP) has been 
centrally concerned with the use of background knowledge in induction. Yet, 
surprisingly, no serious attempts have been made to account for background 
knowledge in refinement operators for clauses, even though such operators are 
one of the most important, prominent and widely-used devices in the field. 

As a first step to developing methods for incorporating background knowl- 
edge into refinement, this paper develops a downward refinement operator for 
clauses into which is built a simple form of background knowledge — a sort theory, 
which encodes taxonomic knowledge. The approach to achieving this is straight- 
forward. The clauses in the space of refinements are sorted in that each variable 
is associated with a sort and the variable ranges over only elements in that sort. 
The background knowledge that is built into refinement is a theory stating how 
sorts are related and what objects belong to what sorts. The background knowl- 
edge is incorporated into subsumption, which underlies the downward refinement 
operator, in one simple way: by restricting attention to those substitutions that 
are well-sorted in that the term substituted for a variable denotes an object that 
belongs to the sort associated with the variable. The development of a sorted 
downward refinement operator based on this new subsumption relation follows 
the same lines used by Nienhuys-Cheng and De Wolf [6] to develop an unsorted 
refinement operator based on unsorted subsumption. 
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The idea of building a theory into a system through its mechanisms for 
instantiation has been well explored and exploited in automated deduction. And 
a very general definition of instantiation with built-in theories has been presented 
by Frisch and Page [4]. I conjecture that it is possible for ILP systems to also 
benefit greatly from using instantiation with built-in theories. Frisch and Page 
[3, 8, 7] have already explored this idea in the context of generalisation. This 
paper takes a first step towards exploring this idea in the context of refinement. 



2 Sorted Logic 



The logical language used in this paper is a very simple sorted logic. It is almost 
identical to ordinary first-order predicate calculus; syntactically it provides some 
simple extensions and semantically it uses the same kind of models. 

In addition to the usual function and predicate symbols, the lexicon of the 
sorted language contains a disjoint set of sort symbols. Typographically, sort 
symbols are written entirely in small capitals as such: mammal. Semantically, a 
sort symbol, like a monadic predicate symbol, denotes a subset of the domain, 
called a sort. 

The sorted clauses with which refinement works are constructed in the same 
way as ordinary clauses except that sorted variables are used instead of ordinary 
variables. A sorted variable is a pair, a;:r, where a; is a variable name and r 
is a sort symbol. For example, a;:DOG is a sorted variable. To avoid confusion 
I never write a formula containing two distinct variables that have the same 
variable name. That is, no formula contains both x:t and x:t' where r and r' 
are distinct, r and to are used as meta-linguistic symbols that always stand for 
sort symbols. 

Semantically, a sorted variable ranges over only the subset of the domain de- 
noted by its sort symbol. The simplest way to define the semantics of universally- 
quantified sorted sentences is in terms of their equivalence to ordinary sen- 
tences: \/x:t (f> is logically equivalent to \fx ~^t{x) V cj)' , where (f>' is the result 
of substituting x for all free occurrences of x:t in (j). The formula that re- 
sults from removing all sorted variables from a formula <p by rewriting with 
this equivalence is called the normalisation of (f> and is denoted by (f>^ . So, for 
example, the normalisation of Va;:MAN Vy:BLONDE Loves{x,y) V Loves{y,x) is 
\/x, y -<Man(x) V -<Blonde(y) V Loves{x, y)) V Loves(y, x), and the two sentences 
are, by definition, logically equivalent. Notice that the normalisation of a sorted 
clause is itself a clause; such a clause is referred to as a normalised clause. 

The background knowledge that is to be built into the instantiation, sub- 
sumption and refinement of sorted clauses is known as a sort theory. A sort 
theory is a finite set of sentences that express relationships among the sorts 
and the sortal behaviour of the functions. Sentences of the sort theory are con- 
structed like ordinary sentences of first-order predicate calculus except that they 
contain no ordinary predicate symbols; in their place are sort symbols acting as 
monadic predicate symbols. Hence, every atomic formula of the sort theory is of 
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the form r(t), where r is a sort symbol and t is an ordinary term. Formulas of 
the sort theory are assigned truth values in the usual Tarskian manner. 

To keep matters simple, we restrict the form of the sort theory. In particu- 
lar each sentence of the sort theory will have one of only two forms: function 
sentences and subsort sentences. Each function sentence is of the form 

Ti{xi) A • • • A Tn{Xn) , X„)) , 

where each Xi is a distinct variable and / is an n-ary function symbol. If n = 0, 
then / is a constant symbol and the form of the function sentence is simply 
r (/) . Each subsort sentence is of the form Va; t\ (x) T 2 (x) . We can think of the 
subsort sentences of the sort theory as forming a graph where each sort symbol 
that occurs in some subsort sentence is a node and there is an arc directed from 
T 2 to Ti if and only if the sort theory contains the formula Va; ri (a;) T 2 (a;) . We 
require that the sort theory is such that its graph is acyclic and singly rooted. 
We shall assume that the root is the sort symbol UNiv. 

Notice that every sentence of a sort theory is a notational variant of a definite 
clause. We will sometimes treat these sentences as clauses, such as when we apply 
the resolution rule of inference to them. 

Where <p is any formula (ordinary or sorted), we say that the universal closure 
of (f> is the result of universally quantifying all free variables in and we denote 
the universal closure of (f> by V^. 

Let T’ be a sort theory. We say that sentence (f>i E -entails sentence 4>2 if 
and only if T’ U {^ 1 } |= (f> 2 , and we write (f>i \=s (f> 2 - Two clauses — sorted or 
unsorted — are said to be E -equivalent if their universal closures T’-entail each 
other. A quasi-ordering, or preorder, is a relation that is reflexive and transitive. 
It is straightforward to verify that \=s is a quasi-ordering on sentences. 

Notationally, throughout the paper 6 and a always denote substitutions, <j) 
always denotes a formula and E always denotes a sort theory. C and D always 
denote sorted clauses unless otherwise stated. To denote a normalised clause, 
and are usually used. Sorted and unsorted clauses are treated as sets, 
though they are sometimes written as disjunctions. If C is a sorted or unsorted 
clause — that is, a set of literals — and L is a literal, then C V L is a clause with 
literals C U {L}. 

3 Sorted Substitution and Sorted Subsumption 

Intuitively, a substitution is a sorted substitution if the term substituted for 
each variable respects the sort associated with the variable. More precisely, a 
substitution 0 is said to be a sorted substitution or E -substitution if for every 
variable x:t, it is the case that T’ |= V r(t) where t is {x:t)9. A sorted clause C is 
E-more general than another D if there is T’-substitution 6 such that C9 = D. 
In this case we write C >s D and say that T> is a T’-instance of C. 

Erisch [2] shows that the identity substitution is a T’-substitution and that 
the composition of two T’-substitutions is also a T’-substitution. It follows from 
this that > 1 ; is a quasi-ordering. 
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If (T is a renaming substitution then 6 and 6 ■ a are said to be variants. Also, 
C and C(T are said to be variants. 

The following theorems provide justification for the given definition of S- 
substitution by telling us that they have the significant properties that ordinary 
substitutions have, hence we hope that this will lead sorted subsumption to have 
many of the important properties of ordinary subsumption. 

Theorem 1 (Sorted Herbrand Theorem [2]). Let C be a set of sorted 
elauses and let Csgr be a set eontaining every ground elause that is a E-instanee 
of some elause in C. Then T’ U VC is satisfiable if and only if Csgr is. 

This Herbrand Theorem, like the ordinary Herbrand Theorem, relates the satis- 
fiability of non-ground clauses to the satisfiability of ground clauses. Notice that 
this Herbrand theorem relates E U C, which contains a sort theory and sorted 
variables, to Cj;gr, which contains no sort theory and no sorted variables (hence 
no sort symbols). Thus, for purposes of satisfiability, all that is relevant about 
sorts has been built into the process of taking T’-instances. 

For ordinary clauses we know that instantiation is a stronger relation than 
entailment, but that over the set of ordinary atoms, the two relations are the 
same. The corresponding result holds for sorted clauses and sorted atoms. 

Theorem 2 ([4]). If C >s D then VC \=s VT>. Furthermore, if C and D are 
sorted atoms, then the eonverse holds. 

Let us now turn our attention to sorted subsumption and begin by recalling 
the definition of ordinary (unsorted) subsumption: unsorted clause E subsumes 
unsorted clause F, written E > F,ii and only ii E6 C F for some substitution 
6. The definition of sorted subsumption is identical, except that only sorted sub- 
stitutions are considered: sorted clause C E -subsumes sorted clause D, written 
C hs D, if and only if C9 C D for some T’-substitution 6. 

Since both the C and >i; relations are quasi-orders, it follows that the 
relation is also a quasi-order. Corresponding to the well-known result that sub- 
sumption implies entailment, T’-subsumption implies T’-entailment: 

Theorem 3. If C D then VC \=s VT>. 

Proof. If C D then, by definition C9 C D for some T’-substitution 0. From 
Theorem 2, we have VC \=s 'i{C6). Since CO C D, the semantic definition of 
disjunction tells us that \/{C9) \=s VT>. From the previous two statements and 
the transitivity of \=s, we have that VC \=s VT>. □ 

4 The Building Blocks of Sorted Substitution 

The definition of sorted substitution has a semantic component. This section 
presents two purely-syntactic characterisations. The first characterisation shows 
how sorted substitutions can be made up by composing elementary sorted substi- 
tutions, which themselves are syntactically defined. The second characterisation 
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is based on inference rules that operate on normalised clauses. Because these 
inference rules have a correspondence to elementary sorted instantiation we call 
them ESI inference rules. After defining elementary substitutions and the ESI 
inference rules, this section proves their correspondence and then proves that 
they provide a sound and complete basis for sorted substitution. Though the 
ESI inference rules are not used in subsequent sections, they are central to this 
section’s proof of the completeness of elementary sorted substitutions and they 
provide insight into the close relationship between elementary sorted substitu- 
tions and well-known inference rules. 

Definition 1 (Elementary T'-Snbstitntion). An elementary S -substitution 
for C is 

1. {x:t ha v-t}, where x:t and y.r are distinct variables that occur in C . 

2. {z:t ha f(xi:Ti,...,Xn.Tn)} where z:t occurs in C and Vri(a;i) A ••• A 
Tn(xn) T(f(xi, . . . ,Xn)) is a Variant of a function sentence in S such 
that xi'.Ti,. . . , Xn-Tn do not occur in C . 

3. {x:ti ha y.T 2 }, where x:ti occurs in C and \/x T 2 {x) ti(x) is a variant of 
a subsort sentence in S such that y.T 2 does not occur in C . 

Definition 2 (ESI Inference Rnles and Derivations). The ESI inference 
rules are: 

1. Binary sort factorisation: Let be a normalised clause whose form is 
-iTi(a;i) V • • • V -<Tn(xn) V C , where C is a clause containing only ordi- 
nary predicates symbols. For all 1 < *, j < n, such that Ti = Tj we say that 
C^{xi HA Xj} is a binary sort factor of . 

2. Binary f-resolution: If is a normalised clause and (f> is a function sen- 
tence, then any binary resolvent of and (f> is a binary f-resolvent of 
and (f>. 

3. Binary ss-resolution: If is a normalised clause and (f> is a subsort sen- 
tence, then any binary resolvent of and (f> is a binary ss-resolvent of 
and (f>. 

A finite sequence of unsorted clauses, Co, . . . , C„, is an ESTderivation of Cn 
from {Co} U E if for all 1 < i < n, Ci can be obtained by applying an ESI 
inference rule to C,_i and a sentence in S . 

Proposition 1. The resolvent of a normalised clause with a clause from a sort 
theory is always a normalised clause. Furthermore, an ESTinference rule applied 
to a normalised clause always produces a normalised clause. 

The operation of the three types of elementary substitutions in the domain of 
sorted clauses corresponds directly to the operation of the three inference rules 
in the domain of normalised clauses. 



Theorem 4 (Correspondence Theorem). Let C and D be sorted clauses. 
Then: 
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1. D = C9 for some 6, a type 1 elementary E -substitution for C, if and only 
if is a binary sorted faetor of . 

2. D is a variant of C9 for some 9, a type 2 elementary E -substitution for C , if 
and only if is a variant of a binary f-resolvent of and some funetion 
sentenee of E. 

3. D is a variant of C9 for some 9, a type 3 elementary E -substitution for C , if 
and only if is a variant of a binary ss-resolvent of and some subsort 
sentenee of E. 

Proof. Let yi'.Ti, . . . ,yn-Tn be the variables of C and let be -'ri(yi) V • • • V 
V C". 

1. Distinct variables x:t and y:r occur in C if and only if t(x) and r(y) are 
distinct atoms that appear in . Thus 9 = {x:t y:r} is a type 1 el- 
ementary T’-substitution for C if and only if has a binary sort factor 
using substitution 9' = {x y}. The reader can verify that this sort factor 
is (C9)^. 

2. For each function sentence (f> £ E we show the correspondence holds between 

the type 2 elementary T’-substitutions using (f> and the binary f-resolutions 
using (j>. Let ^' = V ri(a;i) A • • • A Tn(xn) T(f(xi, . . . ,Xn)) be a variant of 
4> such that . . . ,Xn-Tn do not occur in C . Then z'.t occurs in C if and 

only if t{z) occurs in . Thus 9 = {z'.t f(xi:Ti, . . . is a type 2 

elementary T’-substitution for C if and only if there is a binary f-resolution 
of (f)' and upon the atom t{z). The reader can verify that the resolvent 
is (C9)^. 

3. For each subsort sentence (f> £ E we show the correspondence holds between 

the type 3 elementary T’-substitutions using (f> and the binary ss-resolutions 
using (j). Let (f>' = \fx r^ix) t\{x) be a variant of <j) such that x'.t^ does not 
occur in C. Then x-.t\ occurs in C if and only if ^t\{x) occurs in . Thus 
9 = {x'.Ti y.T^} is a type 3 elementary T’-substitution for C if and only if 
there is a binary ss-resolution of (f>' and upon the atom n (x) The reader 
can verify that the resolvent is (C9)^ . □ 

The soundness of elementary T’-substitutions follows directly from the def- 
initions of T’-substitution and elementary T’-substitution. Then, the soundness 
of ESI-derivations follows immediately from the soundness of elementary substi- 
tutions and the Correspondence Theorem (Theorem 4). 

Theorem 5 (Soundness of Elementary T'-Substitutions). Every elemen- 
tary E -substitution is a E -substitution. 

Corollary 1 (Soundness of ESI-Derivations). D is a E-instanee of C if 
there is an ESI inferenee rule that in one step derives from U E . 

We now turn our attention to the task of proving the completeness of ESI- 
derivations and, hence, also elementary substitutions. We first introduce some 
useful notation and concepts and prove two lemmas before turning to the main 
proof. 




110 A.M. Frisch 



Let 4> be any formula with n top-level term oeeurrenees, that is, n occurrences 
of terms as arguments to predicates. Number the occurrences from 1 to n in left- 
to-right order, as they appear in (j). We use to denote a formula 

whose ith top-level term occurrence is t,, for all 1 < i < n. Subsequently we use 
(f>[t[ , . . . , tjj] to denote the formula that results from replacing the ith top-level 
term occurrence in 4>[ti, . . . ,t„] with t', for all 1 < i < n. 

We establish the completeness of ESI-derivations by showing their corre- 
spondence to SLD-style derivations. The SLD derivations we use are of the more 
general form presented by Nienhuys-Cheng and De Wolf [6]. In this form the top 
clause and all derived clauses may be any Horn clause, though the side clauses 
must still be definite and resolution must still be upon the positive literal of the 
side clause. In this general setting, the subsumption theorem does not hold if a 
fixed selection function is used;^ so we shall ignore selection functions and, to 
avoid confusion, call such derivations LD-derivations. Thus, in an LD derivation 
any negative literal of a center clause may be resolved upon. 

Lemma 1. If C and D are sorted atoms sueh that ^ then some variant 
of ean he derived from using (zero or more applieations of) only sorted 
binary faetoring. 

Proof. Let be V • • • V ^Tn(xn) V Head. Then, since D is atomic, 

can be written as 

Tv -^T\{x\0) V • • • V -<Tn(xnO) V Head6 , (1) 

where 0 is a substitution and T is a possibly empty disjunction of negative sort 
literals, none of which are Ti{xi0) (1 < i < n). Then 6 must map every Xi 
(1 < i < n) to a variable, otherwise (1) would not be a normalised clause. Thus, 
either 0 is a renaming substitution, or it “collapses” two or more variables in 
the sense that it maps them to the same variable. Whatever collapses it does 
can be considered as a sequence of collapses each of which collapses exactly two 
distinct variables. Consider a collapse of two distinct variables, Xi and Xj. Then 
Ti and Tj must be the same, otherwise (1) would not be a normalised clause. Now 
observe that this collapse of Xi and Xj is a variant of the collapse produced by 
applying binary sorted factorisation to the literals ^Ti(xi) and -<Tj{xj). Hence 
the sequence of collapses can be produced by a sequence of applications of binary 
sorted factorisations, and the effect of applying 9 to can be achieved by a 
series of zero or more binary sorted factorings followed by the application of 
a renaming substitution. By Proposition 1 C^9 must be a normalised clause; 
therefore T must be empty, otherwise (1) would not be a normalised clause. Thus 
(1) can be produced from by application of a series of zero or more binary 
sorted factorings followed by the application of a renaming substitution. □ 

Lemma 2. Let C and D he sorted atoms. Then every LD derivation of 
from \J S is an ESI derivation of from U S. 

^ This point is not stated explicitly by Nienhuys-Cheng and De Wolf. 
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Proof. Note that and all members of E are definite clauses. Let P 

be the predicate symbol in the positive literal of . Also note that in an LD 
derivation the positive literals of all derived clauses and of the top clause are 
all the same. Thus the top clause must be as no clause in E contains P. 
And all of the other input clauses to the derivation must be members of E. 
Hence every resolution in the LD derivation is a binary ss-resolution or a binary 
f-resolution. □ 

Theorem 6 (Completeness of ESI-Derivations). If C >s D then there is 
an ESI derivation of a variant of from U E . 

Proof. Let us write C as a disjunction of distinct sorted literals, Ci V • • • V C„ — 
which we shall call C^\t\, ...,tu]. Then C >s D if and only there exists a dis- 
junction, T>i V- • • VT>„ — which we shall call D'^[t[, — and a T’-substitution, 

6, such that D = | 1 < * < n} and CiO = Di for all 1 < i < n. Let 

P be a predicate symbol of arity k that does not appear in C, D or E. Ob- 
serve that D'^[t[,...,t'if\ = C'^[ti6, ...,tk9\. Thus P{ti, . . . ,tk)9 = P{t[, . . . ,t'^), 
which, from the work of Frisch and Page [4], implies that V P(ti , . . . ,tk) \=s 
V P{t[, . . . ,t'i^). This, in turn, implies that V P{ti, . . . ,tk)^ |=i; V P{t[, . . . ,t'i^)^ ■ 
Since E, P{t\, . . . ,tk)^ , and P{t[, . . . ,t'f.)^ are all Horn clauses, the subsump- 
tion theorem for LD-derivations [6] tells us that there is a definite clause A that 
is LD-derivable from {P{ti , . . . , tk)^} U E such that A subsumes P{t[ ,... ,t'u)^ ■ 
From Lemmas 1 and 2 we know that this implies that there is a definite clause A 
that is ESI-derivable from {P{t\, . . . ,tk)^} U E and a variant of P{t[, . . . ,t'k)^ 
that is ESI-derivable from {A} UP. Hence, there is an ESI derivation of a variant 
of P{t[, . . . ,t'k)^ from {P{t\, . . . ,tk)^} U E. Now let us replace each clause in 
this derivation, which is of the form P(si, . . . , Sk)^ , with . . . , The 

resulting derivation is the desired ESI derivation — from U P to a variant 

ofD^. □ 

The completeness of elementary substitutions follows immediately from the 
completeness of ESI-derivations and the Correspondence Theorem (Theorem 4) . 

Corollary 2 (Completeness of Elementary Snbstitntions). If C >s D 

then there exists a finite sequenee of elauses, Cq, . . . ,C„ sueh that Cq = C, 
elause Cn is a variant of D, and for every 1 < i < n there is some a, an 
elementary E -substitution for Cj-i, sueh that Ci-ia = Ci. 

5 Sorted Downward Refinement 

A downward refinement operator in general is used to generate specialisations of 
a formula; the particular downward operator introduced by this paper operates 
on sorted clauses and it produces sorted clauses that are specialisations in the 
sense that they are P-subsumed by the clause operated on. 

Starting with a set of one or more initial clauses, a refinement operator can 
be used repeatedly to generate more clauses to add to the set. This defines a set 
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of clauses, called a refinement space, that a learning algorithm can search in an 
attempt to find a suitable hypothesis. 

An important question that needs to be addressed is what formulas would 
we like to be in the refinement space. That is, what formulas make reasonable 
hypotheses? In the unsorted setting, a common and simple answer to this ques- 
tion is that the space should include every clause subsumed by an initial clause. 
But in the case of sorted clauses the question is trickier because not every clause 
is, in a certain sense, sensible. The answer we shall adopt here is that clauses in 
the refinement space should contain only terms that conform to the sort theory 
in the the sense that they are well sorted, as we now define. 

Definition 3 (Well Sorted Terms and Clanses). Term t has sort r with 
respeet to E if E |= Vr(t). In addition, t is well sorted with respeet to E if for 
some sort symbol t term t has sort t with respeet to E. By extension, we say 
that a elause is well sorted with respeet to E if every term that oeeurs in it is 
well sorted with respeet to E. Where E is obvious, we shall simply say “well 
sorted. ” 

A number of simple consequences of this definition are worth noting, and we 
do so without proof: 

Proposition 2. If E is an arbitrary sort theory then: 

1. The empty elause is well sorted with respeet to E. 

2. With respeet to E, a variable, x:t, has sort t and is therefore well sorted. 

3. f(ti, . . . ,tn) has sort t with respeet to E if and only if E entails some fune- 
tion sentenee of the formVxi, . . . ,Xn ti(xi)A- • •Ar„(a;„) T(f(xi, . . . ,Xn)) 
sueh that eaeh ti has sort Ti, for 1 <i <n. 

4- If f(tij ■ ■ ■ 1 tn) is well sorted with respeet to E, then so are ti, . . . , t„- 

Point 4, which is an immediate consequence of point 3, is useful for simplifying 
the test for well sortedness. 

More importantly, the set of well-sorted formulas has the property that it is 
closed under the application of sorted substitutions. 

Theorem 7. If t is a term that has sort t with respeet to E and 6 is a E- 
substitution then t9 has sort t with respeet to E. Thus if e is a term or elause 
that is well sorted with respeet to E, then so is e9. 

Proof. This is proved by induction on the structure of t. In the base case, t is a 
constant or a variable. In the case in which t is a constant, the theorem holds 
trivially since t = t9. In the case in which t is a variable, let x:lo be t and let 
s be {x:tu)9. Since 0 is a T’-substitution, by definition E |= \ftu{s). Thus, by 
Definition 3, s, which is equal to t9, has sort r with respect to E. 

For the inductive case, let t be f{t\, . . . ,tn) and assume (inductive hypoth- 
esis) that the theorem holds for for ti , . . . , . By point 3 of Proposition 2 E 

entails some function sentence 4> of the form Va;i ,... ,Xn t\ ( xi ) A • • • A r„ (a;„) 
T{f{x\, . . . ,Xn)) such that each has sort r,, for 1 < i < n. Then by the 
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inductive hypothesis each ti6 has sort r,, for 1 < i < n. This means that 
E entails Vri(ti0) A • • • A Tn{tnO). And since E entails it must also entail 
V . . . ,t„0)). Thus, by the Definition 3, t6, which is . . . ,t„0), has 

sort r. □ 

Consequently, when a refinement operator generates a new term by applying a 
T’-substitution to a well sorted term, there is no need to test that the resulting 
term is well sorted. 

Definition 4 (Sorted Downward Refinement). If C is a sorted clause, its 
downward E -refinement, written p^(C), is the smallest set such that: 

1. For each 0 that is an elementary E -substitution for C, Pj,(C) contains C0. 

2. For each n-ary predicate symbol P , fet a;i :UNiv, . . . , a;„:UNiv be distinct vari- 

ables not appearing in C . Then Ps(C) contains C'VP(a;i :UNiv, . . . ,a;„:UNiv) 
and C V :UNiv, . . . ,a;„:UNiv). 

p*j,iC) is the smallest set such that C € p*j,iC) and if D G p*j,iC) then Ps{D) C 
PliC). 

We now present a simple lemma and then use it to show that this downward 
refinement operator has the desired properties. 

Lemma 3. If C >s D then p*^(C) contains a variant of D. 

Proof. Since p^ can apply any elementary T’-substitution, this lemma is an 
immediate consequence of Theorem 2. □ 

Theorem 8 (Sorted Downward Refinement Theorem), p^ is 

— Correct: If D G Pj,(C) then C D and if C is well sorted with respect to 
E then so is D; 

— Finite: If the set of predicate symbols in the language is finite then Ps(C) is 
finite; and 

— Complete: If C and D are well sorted clauses with respect to E and C D 
then some variant of D is a member of p*^(C). 

Proof. Correct: Consider as two cases the two forms of downward refinement 
given in the definition of downward refinement (Definition 4). If T> is generated 
in case 1 then D = C0 for some elementary T’-substitution 0. Hence, by the 
soundness of elementary sorted substitutions (Theorem 5), C D and, by 
Theorem 7, if C is well sorted with respect to E then so is D. 

If D is generated in case 2, then C C D so, by definition, C D. If C is 
well sorted with respect to E then so is D since every term in D is either in C 
or is of the form a;:UNiv — which is well sorted by point 2 of Proposition 2. 

Finite: Let v be the number of variables occurring in C; s be the number 
of subsort sentences in T’; / be the number of function sentences in E] and p 
be the number of predicate symbols in the language. Case 1 of the definition 
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of downward refinement produces a refinement by applying an elementary E- 
substitution to C. The number of elementary T’-substitutions for C of type 1, 
2 and 3 is bounded by v{v — l)/2, f ■ v and s ■ v respectively. The number of 
refinements produced by case 2 of the definition of downward refinement is 2p. 

Complete: Let 0 be a T’-substitution such that C9 C D. From Lemma 3 we 
know that C9 is a variant of a member of p*^ (C) . Below we shall show that D 
is a variant of a member of p*^ (C9) . From these two facts we can conclude that 
T> is a variant of a member of p*^ (C) . 

To show that T> is a variant of a member of p*^{C9) we show that p*^ can 
extend a clause by adding any one well sorted literal; hence it can extend a clause 
by adding any finite number of well sorted literals. More precisely we show that 
if A is a well sorted clause and L is a well sorted literal then AV L is a variant 
of a member of p%{A). Let L be of the form P{t\, ■ ■ ■ ,tn) (or . . . ,tn)) 

and let L' be P(a;i :UNiv, . . . , a;„:UNiv) (or -iP(a;i :UNiv, . . . , a;„:UNiv)), where 
xi, . . . ,Xn are variable names that do not occur in A. Observe that L' — and, 
hence, A\/ L ' — is well sorted with respect to S and that AV Lis a T’-instance of 
AV L' . Hence , from Lemma 3 , A V L is a variant of a member oi p^{A\/ L'). From 
case 2 of the definition of downward refinement (Definition A) A V L' £ p^ (A) . 
From these last two sentences we reach our desired conclusion: AV Lis a variant 
of a member of p^( A). □ 



6 A Comparison of Sorted and Unsorted Refinement 



This section compares the sorted downward refinement operator p^ defined in 
this paper with the unsorted downward refinement operator p^ presented by 
Nienhuys-Cheng and De Wolf [6] and credited to Laird [5]. This comparison is 
conducted by considering p^ as operating on the set of normalised clauses (to be 
precise, p^{C^) = {D^ \ D £ p^iC)}) and p^ as operating on unsorted clauses 
whose predicate symbols are either ordinary predicates symbols or sort symbols. 

In a couple of ways p^ generates a smaller space than does . If C is a well 
sorted clause, then every member of p^ (C^) is the normalisation of a well sorted 
clause. However this is not the case for p ^ : p^ (C^) may contain clauses that are 
the normalisation of no sorted clause or of a sorted clause that is not well sorted. 
Furthermore, it can be shown that every member of Ps{C^) is T’-equivalent to 
some member of p^{C^). However the containment doesn’t hold the other way. 

Another difference between the two refinement operators is that p^ takes 
account of E in the order that it generates clauses, whereas p^ does not. To illus- 
trate this let E be {Va; ODD(a;) iNTEGER(a;)}, let be P(a;) V-iiNTEGER(a;), 
let be P{x) V - 1000 ( 2 ;), and let D' be P{x) V -iiNTEGER(a;) V -iOOo(a;). Ob- 
serve that p*^{C^) contains but p*^{C^) does not contain — though it 
does contain the T’-equivalent clause D' . Thus the p^ space fails to recognise the 
equivalence of and D' and locates them in different places in the refinement 
space. 
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7 Further Work and Conclusions 

Of the numerous issues that require and deserve further study, two seem more 
immediate than the others. First, the relationship between the sorted subsump- 
tion relation and Buntine’s [1] generalised subsumption relation is a very 
close one and deserves explication. Second, in certain cases some of the imme- 
diate sorted refinements of a sorted clause may be ii’-instances of others. This 
redundancy is undesirable and methods for eliminating it need to be investigated. 

To conclude the reader is asked to observe the degree to which this paper’s 
development of sorted subsumption and sorted refinement parallels that of the 
unsorted case except with sorted substitution replacing unsorted substitution. 
Though proving the completeness of elementary sorted substitutions requires 
a fair amount of new mechanism, this paper’s approach to sorted refinement 
presents no surprises. And that is the point. One can build background knowledge 
into subsumption and refinement by building it into substitution — and nowhere 
else. And, unlike other routes, the approach to doing so is straightforward. 
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Abstract. A new IFLP schema is presented as a general framework for 
the induction of functional logic programs (FLP). Since narrowing (which 
is the most usual operational semantics of FLP) performs a unification 
(mgu) followed by a replacement, we introduce two main operators in 
our IFLP schema: a generalisation and an inverse replacement or intra- 
replacement, which results in a generic inversion of the transitive prop- 
erty of equality. We prove that this schema is strong complete in the way 
that, given some evidence, it is possible to induce any program which 
could have generated that evidence. We outline some possible restric- 
tions in order to improve the tractability of the schema. We also show 
that inverse narrowing is just a special case of our IFLP schema. Finally, 
a straightforward extension of the IFLP schema to function invention is 
illustrated. 

Keywords: Functional Logic Programming, Inductive Logic Program- 
ming, Function Invention, Induction of Auxiliary Functions, Narrowing, 
Inverse Narrowing. 



1 Introduction 

Inductive logic programming (ILP) [9] is the branch of machine learning that 
studies concept learning in a logical framework. Namely, ILP deals with the 
induction of logic programs (i.e. finite sets of Horn clauses) from examples and 
background knowledge. 

The use of logic programming for learning is mainly based on the idea that 
logic programs are a single representation for examples, background knowledge 
and hypotheses. However, logic languages like Prolog (the most representative 
language of this paradigm) lack some programming facilities such as evaluable 
and nested functions, types, higher order programming and lazy evaluation. Al- 
though these features are well supported by functional languages, they lack the 
computing power provided by logical variables and unification. Hence, the inter- 
est in the integration of both families of languages has grown over the last few 
years. 

Integrated languages fully exploit the facilities of logic programming in a 
general sense: functions, predicates and equality. One relevant approach [4, 6] 
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to integration is functional logic programming where the programs are logic 
programs which are augmented with Horn equational theories. A lot of work 
has been invested in the development of the semantics of integrated languages. 
Therefore, it has been shown that the main semantic properties of logic pro- 
grams also hold for functional logic programs (least model, fixpoint semantics) 
[1]^. Operational semantics is defined in terms of semantic unification or £- 
unification [15] (i.e., general unification wrt an equational theory £). Narrowing 
[5, 14] is a sound and complete ^-unification method for theories which satisfy 
some requirements (such as confluence and termination properties or the absence 
of extra variables in the condition of the equations). Narrowing can be seen as 
a combination of resolution from logic programming and term reduction from 
functional programming. Hence, it is widely accepted that narrowing is the key 
to describing operational semantics of functional logic languages. 

In [3] we have presented a framework for the induction of functional logic 
programs (IFLP) from (positive and negative) examples. The evidence is com- 
posed of equations, and their rhs’s are normalised wrt the background knowledge 
and the theory to be induced. In logic programming, the induction can be made 
top-down (starting from the most general program and refining it by specialisa- 
tion) or bottom-up (starting from positive data as a program and generalising 
it) . In the case of functional logic programs, we cannot follow a top-down direc- 
tion because the examples are equations, and the most general program X = Y 
would not make the program terminating nor confluent. As a consequence, the 
kernel of our method was an inverse narrowing mechanism (similar to the in- 
verse resolution operator of ILP) which selects pairs of equations to obtain an 
equation which is usually more general than the original ones. The starting set of 
equations is a generalisation of the positive examples which is made by replacing 
terms by variables at some occurrences. In fact, the algorithm combines inverse 
narrowing and generalisation in each step. The method is effective, but it is too 
specific for those cases where auxiliary terms are involved. 

Let us show this with an example. 

Example 1. Consider the following evidence 

p+ ^ f '■ /(“) = '>'{g{b,b)) I 
\ et ■■ h{a, b) = r{a) j 

and suppose that sufficient negative examples are provided to justify the program 

( n : h{Y, a) = g(Y, b) 

P= I T 2 -. f{X) = r{hib,X)) 
y T 3 : h{a, b) = r(a) 

However, P could never be induced by inverse narrowing. This is because 
the example ef directly relates the function symbols /, r and g (we are not 
considering other constant symbols in the equations), whereas the equations ri 
and T 2 from P define the function / in terms of r and g but through the function 
h. This last function can be thought of as an auxiliary function in the definition 

^ In this paper we do not address any questions related to declarative semantics. 
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of /. The generalisation step in the inverse narrowing approach does not take 
this possibility into account since there is no positive evidence that links the 
symbols / and h nor the symbols h and g. 

In this paper, we define a new framework, the IFLP schema, as a general and 
strong complete framework for solving the IFLP problem. By strong complete- 
ness we refer to the capability of inducing all possible programs such that the 
positive examples hold wrt them but the negative examples do not. The term 
‘strong’ is due to the fact that, in this context, weak completeness makes no 
sense since it is always possible to find a program that covers all the positive 
examples and none of the negative ones: the positive examples themselves. Other 
completeness results could be stated in terms of some extra conditions that the 
program should follow (e.g. Progol). The idea is to generalise the way in which 
the narrowing relation is inverted to induce theories which use auxiliary func- 
tions. The inductive method proposed is closely related to the transitive property 
of equality. More exactly, we define a new operator that reverses the direction in 
which transitivity is applied. Then, we prove that the schema is complete in the 
sense mentioned above. We also show that the IFLP schema is rather general to 
have inverse narrowing as one of its instances. Finally, we deal with the function 
invention problem which can be easily formalised in our schema. In this context, 
we can consider an invented function as an auxiliary function of a new signature 
that extends the hypothesis language with new functions. 

The work is organised as follows. In Section 2, we recall the main concepts 
of functional logic programming and we formalise the narrowing semantics we 
focus on. Section 3 reviews the inverse narrowing approach and analyses the way 
in that theories are induced. This motivates the introduction of new operators 
to overcome the limitations of inverse narrowing. The IFLP schema is defined 
in Section 4. The strong completeness of the schema is discussed in Section 5. 
Section 6 shows that inverse narrowing is an instance of our schema. In Section 
7, the setting is easily changed to include function invention. Finally, Section 8 
concludes the paper and discusses future work. 



2 Preliminaries 

We briefly review some basic concepts about equations. Term Rewriting Systems 
and ^-unification. For any concept which is not explicitly defined, the reader may 
refer to [2, 8, 15]. 

Let be a set oi function symbols (or functors) together with their arity^ and 
let T be a countably infinite set of variables. Then T {S, T) denotes the set of 
terms built from S and X. The set of variables occurring in a term t is denoted 
Var(t). This notation naturally extends to other syntactic objects (like clause, 
literal, . . .). A term t is a ground term if Var(t) = 0. A substitution is defined as a 
mapping from the set of variables X into the set of terms T{S, X). An occurrence 
M in a term t is represented by a sequence of natural numbers. 0(t) and 0{t) 



■ We assume that S contains at least one constant. 
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denote the set of occurrences and non-variable occmrences of t respectively. 
denotes the subterm of t at the occurrence u and t[t']u denotes the replacement of 
the subterm of t at the occurrence u by the term t' . An equation is an expression 
of the form I = r where I and r are terms. I is called the left hand side (Ihs) of the 
equation and r is the right hand side (rhs). An equational theory £ (which we 
call program) is a finite set of equational clauses of the form I = r <1= ei, . . . , e„. 
with n > 0 where is an equation, 1 < i < n. The theory (and the clauses) 
are called conditional if n > 0 and unconditional if n = 0. An equational theory 
can also be viewed as a (Conditional) Term Rewriting System (CTRS) since the 
equation in the head is implicitly oriented from left to right and the literals ei 
in the body are ordinary non-oriented equations. Given a (C)TRS TZ, t s 
is a rewrite step if there exists an ocurrence w of t, a rule I = r & TZ and a 
substitution 9 with = 9{l) and s = t[9{r)]u- A term t is said to be in normal 
form wrt TZ if there is no term t' with t t' . We say that an equation t = s 

is normalized wrt TZ \i t and s are in normal form. TZ is said to be canonical 
if the binary one-step rewriting relation — is terminating (there is no infinite 
chain si — ■S 3 ■ ■ ■) ^nd confluent (V si, S 2 , s^ E T (S, A) such that 

Si — S 2 and si — S 3 , 3 s G T (S, A) such that S 2 — s and S 3 — s). An 
^-unification algorithm defines a procedure for solving an equation t = s within 
the theory £. Narrowing is a sound and complete method for solving equations 

wrt canonical programs. Given a program P, a term t narrows into a term t’ 
0 

(in symbols t ’ P G^) iff w G 0(t), i = r is a new variant of a rule from 
P, 6 = mgu{t]^y,. l) and t' = 9{t[r]u). We write t^pt' if t narrows into t' in n 
narrowing steps. 

3 The Inverse Narrowing Approach 

In this section, we briefly outline the inverse narrowing approach we have pre- 
sented in [3]. The algorithm was composed of two operators: Consistent Re- 
stricted Generalisation and Inverse Narrowing. 

Since we had to ensure posterior satisfiability, the inverse narrowing method 
began generating all possible restricted generalisations from each positive ex- 
ample which was consistent with both positive and negative examples. It was 
computed by the Consistent Restricted Generalisation operator. 

Definition 1. Consistent Restricted Generalisation CRG 

An equation e = = ri} is a consistent restricted generalisation (CRG) wrt 

E~^ and and an existing theory T = B U P if and only if e is a restricted 
generalisation for some equation of E~^'^ (always oriented left to right) and there 
does not exist: (1) a narrowing chain using e and T that yields some equation 

Q l=r,9 . 9 . 

Or simply t ^ p t or t t if the occurrence or the rule is clear from the context. 
Also, the subscript P will usually be dropped when clear from the context. 

^ An equation t = s is a, restricted generalisation of an equation r — m, ii 3<t : aft) = 
r A a{s) = m, and G Var{s) x E Var(t)). 
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o f E , and (2) a narrowing chain using e and T that yields a different normal 
form for some Ihs different from the rhs which appeared in the equations o f E~^. 

Secondly, the inverse narrowing operator was defined as an operator that 
generates an equation from two equations. 

Definition 2. Inverse Narrowing 

Given a functional logic program P, we say that a term t conversely narrows 

U 0 

into a term t’ , and we write t p t' , iffu& 0{t), I = r is a new variant of 

a rule from P, 9 = mgu(t\y^,r) and t' = 6(t[l]u). The relation <-^p is called the 
inverse narrowing relation. 

Now, we will concentrate our attention on how the inverse narrowing ap- 
proach induces equations. Suppose that s = t and I = r are the equations 
selected by the algorithm, such that unifies with r with 9 = mgu{t\u. r). 
Then, s = 9(t\l]y) and I = r are the two equations induced in an inverse nar- 
rowing approach step. It is easy to see the relationship between this algorithm 
and the transitive property of equality. In what follows, for the sake of legibility, 
we consider x, y and 2 to be subterms at the occurrence e. The next rationale is 
still valid for any other occurrence in a term. 

The transitive property is expressed as: 

x^yAy^z^x^z(l) 

whereas an inverse narrowing approach step can be also represented as: 
x^zAy^z^x^yAy^z(2) 

where x —>■ y in (2) is the equation computed by inverse narrowing from x —>■ z 
and y ^ z. However, (2) is not a real inversion of transitivity because it begins 
from two equations (one of the premises and the result) of the formula (1) and it 
generates its other premise. To have a constructive inversion of the transitivity 
of equality, the behaviour of the algorithm should be as follows: 

x^z^x^yAy^z (3) 

where x —> y and y —>■ z are the result of this constructive inverse narrowing. 
Notice that the term y in the above formula (3) is new. The following schema 
not only extends the setting to cope with this inverse transitive, but also to cope 
with inverse replacement. This is the mechanism which will allow us to introduce 
auxiliary functions in the inductive process. 



4 IFLP Schema 

Let us denote the set of function symbols of arity > 0 which appear in a program 
P as Ep. In the same way, Ee+, or simply i7+, denotes the set of function 
symbols of arity > 0 which appear in the positive evidence E~^. 
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As we have stated, narrowing is based on a mgu, which is a specialisation, 
followed by a replacement. It is logical then to base the induction of functional 
logic programs on an inversion of these deductive operators. Consequently, we 
introduce two operators: an inverse specialisation, namely a generalisation, and 
an inverse replacement. 

Definition 3. Unrestricted Generalisation (UG) 

An equation e! = [V = r’] is an unrestricted generalisation (UG) of an equation 
e = [I = r) if and only if there exists a substitution 9 such that 9 ( 1 ’) = I and 
9 (r') = r. 

Definition 4. Single Intra-Replacement (SIR) 

Given an equation s = t, choose any occurrence uj of t and any function symbol 
F G to construct a new term q in the following way: 

q = 

where (p = F(Xk^i, Xk,2, ■ ■ ■ , n >Q is the arity of F and Xk^i are different 

fresh variables. The subscript k is used to distinguish these variables from other 
variables in previous or subsequent uses of this operator. 

As output, the SIR operator produces a first equation Dk as: s = q, and a 

second equation as: q\^^ = t\^^ 

The first result from this definition is that and make true that s^‘^t, 

€,s=(3,0 

i.e. s can be narrowed into t in two narrowing steps, because s ^ q and 

=t\uiA 

q ^ t. Following the definition, and taking into account both Dfc and E'fc, 
the operator SIR can only generalise. However, if the occurrence aj is a variable 
X, the second equation is of the form t = X. li we remove this equation, it can 
be said that SIR specialises. Despite this seemingly contradictory behaviour, the 
operator must be used interactively in order to specialise a variable into a term 
which has more than one function symbol. 

Example 2 . Suppose an original equation f(g(a)) = b and H"*" = {f,g,h,a,b} with 
their corresponding arities. By choosing the occurence lo = e and F — h, we generate 
the following two equations: 

a first equation as: f(g(a)) — h(Xk,i,Xk,2) 

and a second equation Ek as: h(Xk,i, Xk,2) — b 

We can apply the same operator to Dk at occurrence u>' = 2 and F = a. This gives 
a third equation Dk+i as: f(g(a)) = h(Xk,i,a) 

and a fourth equation Ek+i as: a — Xk,2 

It is easy to show that the original equation is covered by the program which can be 
constructed from Dk, Ek,Dk+i, Ek+i- However, it would be interesting to be able to 
specialise the Ihs of EEs and to allow more then one new symbol on the Ihs. 

Both things can be obtained by using the following simple operator: 

Definition 5. Syntactic Folding (SF) 

Given two equations E\ = \ li = ri\ and E2 = {I2 = ^^2} with r\ being a variable 
such that there exists an occurrenceuj such thatr\ = (^2)1^; ® new folded equation 
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can be constructed as ^2(^1]^ = ^2- The same applies if such an occurrence is in 
f 2 - 

In the previous example -Efc+i and Ek could be folded into h{Xkp , a) = 6 by 
using the occurrence uj= 2 . 

Example 3 . Consider Example 1 again. If the first equation from the evidence is se- 
lected, i.e. /(a) = r(g(b,b)), and the SIR operator is applied at occurrence lj — 1 and 
with function symbol h, the following two equations are generated: 
a first equation Dk is: f(a) — r(h(Xk,i, Xk,2)) 

and a second equation Ek as: h(Xk,i, Xk,2) = g(b,b) 

We can apply the same operator to Dk at occurrence lj' = 1.1 and F — b. This pro- 
duces: 

a third equation Dk+i as: /(a) = r(h(b, Xk,2) 

and a fourth equation Ek+i as: b = Xk,i 

If SIR is applied again to Dk+i but now at occurrence u>' = 1.2 and F = a, this gives: 
a fifth equation Dk+2 as: /(a) = r{h{b, a) 

and a sixth equation Ek+2 as: a = Xk,2 

Equation Dk+2 can be generalised into f(Xk,2) — r(h(b, Xk,2) which is one rule of pro- 
gram P. By using the SE operator, Ek and Ek+2 can be folded into h(Xk+,a) = g(b, b) 
and then folded again by using Ek+i into h(b, a) = g(b, b) which can then be generalised 
into h{Xk+,a) = g(Xk,i,b), which is another rule of the program. 

These three operators are able to construct virtually any term as the following 
lemma and theorem show: 

Lemma 1 . Select any term r constructable from T(E). Given any equation 
s = t and any occurrence lo of t there exists a finite combination of the SIR and 
SF operators that generates these two equations: 
a first equation D as: s = q 

and a second equation E as: q\^^ = t\^^ 

where q = t\r]^^. 

Proof. Let us prove this lemma by mathematical induction. Consider d equal to the 
depth of the tree which can be drawn from r, e.g. f{g{a,h{a,a))) has depth 4 . 

Eor d = 1 , the lemma is obvious because it is only necessary to apply the SIR 
operator at occurrence 1+ with the term (j> = r. 

Let us suppose the hypothesis that the lemma is true for k. Then, we have to show that 
it is true for fc -|- 1 . Consider that r|„ = fl(ni, <12, . . . , Un) where ai are function symbols 

of arity 0 , i.e. constants, and u = X1.X2 Xk and there is no other occurrence at level 

fc-|-l but the ai. By hypothesis we have been able to construct two equations for depth 

k: 

a first equation Dk as: s = q 

and a second equation Ek as: 

where q = t[r']uj, with r' being r[a]a,j^.a,2 xf. where this a does not appear again in r' . 

Since this a appears once, it is obvious that this step could have been avoided and we 
could have a variable X instead of a term a as well. 

Let us apply the SIR operator to the first equation at occurrence 1+' = xi.X2 Xk 

with (p = g(Xk+,Xk,2, ■ ■ ■ ,Xk,n), n > 0 is the arity of F and Xk+ are different fresh 
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variables. This generates two equations: 
a first equation Dh+i as: s = q 

and a second equation as: q'^^, = q^^^i 

where q' — q[4>]ui'- We can apply the SIR operator n times with function symbol a to 
Dk+i at all its n positions giving respectively: 
a first equation Di as: s = q'[ai]i 

and a second equation Ei as: a = Xk,i 

These Ei can be used jointly with Dk+i by operator SF to construct a new equation A, 
s = q[g(ai, a 2 , • • • , , which is equal to s = f and Di can be used jointly with 

Dk+i by operator SF for a second equation, q[g{ai,a 2 , • • • ,o.n)]u>' — Q\uj' ■ Finally, since 
the rhs of this last equation is X, we can apply a SF operator to this last equation and 
Ek giving an equation B, r — Both A and B are precisely the equations D and E 
of the lemma. 

Since this holds for fc + 1 if it holds for k, we can affirm that it holds for all k. □ 

Theorem 1. Select any term r' which is constructable from T{S,X). Given 
any equation s = t and any occurrence lo o f t, there exists a finite combination 
of the SIR, the SF and the U G operators that generates these two equations: 
a first equation D'^ as: s = q’ , and a second equation E’^ as: q’^^^ = 

where q’ = t\r'],^. 

Proof. Given the equation s = t and any term r' , consider a new term r such that 
any variable in r' is substituted by a function symbol of arity 0. Obviously, this r is 
ground, and, by lemma 1 it can be constructed by a finite combination of the operators 
SIR and SF, resulting in a first equation Dk as s = q, and a second equation Ek as 
Q\lo — f|i^i where q — 

Take Dk and use a UG to obtain a new equation s = which is equal to s = gL In 

the same way all the Ek can be generalised to obtain a new equation r' = t\i^. n 



5 Strong Completeness of the IFLP Schema 

Theorem 1 is essential to be able to show that any possible intermediate term 
that may be used in a derivation can be induced by using the operators of the 
IFLP Schema. This leads to the following strong completeness result: 

Theorem 2. Strong Completeness 

Given a finite program P , and a finite evidence E generated from P, such that 
every rule of P is necessary for at least one equation of the positive evidence ( i. e. 
if removed some positive example is not covered), and Ep = , i.e., all function 

symbols of the program appear in the positive evidence, then the program can be 
induced by a finite combination of the operators presented in the IFLP schema, 
that is to say. Unrestricted Generalitation (UG), Single Intra- Replacement (SIR) 
and Syntactic Folding (SF). 

Proof. Select any rule r = {s = f} from P. Since it is necessary, it is used in at 
least one derivation of one example, say ag — an- We express this derivation as: 
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,Z| =r\ ,9\ 

If n = 1, i.e. the derivation «o ^ ni then we have that under the IFLP schema 
we can generate a first equation Dk as: «o = no, and a second equation Ek as: 

(no)|,^ = (ni)|,^, such that u> — ui, and what has to be introduced is g = (no)|,^- This 
can be done as was shown in Theorem 1. The last equation Ek can be generalised in 
order to match li = n. 

Let us assume the hypothesis that we have been able to generate all the h — ri 
upto n — 1. Then, for n we have : 

wi,Zl— ri,6i ^2^2— "^2,^2 

ao ^ ni ^ «2 ^ ^ dn 

Since it has been generated to a^-i we only have to show that it is possible to 



Uji ,lji =Vji ,0ji 

generate the equation that allows for narrowing from Un-i to an, i.e. Un-i ^ 
dn- However, this step is no different from the step we proved for n = 1, so we can find 
this In — Tn and the hypothesis is true for all n. Thus, the theorem is proven. □ 



Strong Completeness is not usual in the inductive literature (except [12]), 
because, without additional information (e.g. modes) it entails intractability. 
However, the previous theorem discovers a set of operators which are sufficient 
to induce any possible program. Further work is centred on finding restrictions 
which preserve completeness or bring the schema to tractability. Among the 
latter there are at least two ways possible. A first option is to fix a selection 
criterion (e.g. compression) ensuring completeness wrt this criterion by using 
an ordered search space and mode declarations (e.g. [11]). A second one is to 
study uncomplete but still powerful instances of the schema and provide efficient 
algorithms for them. The first option is in progress by the authors through the use 
of genetic programming as in [17]. The second option was precisely undertaken 
in [3] and the next section discusses its relation to the preceding schema. 



6 Inverse Narrowing as an Instance of the IFLP Schema 

In this section, we show that Inverse Narrowing is just an instance of our generic 
IFLP Schema. This relationship allows a more detailed study of our previous 
algorithm, its limitations and its extensions to cover more difficult cases without 
falling into intractability. 

First of all, it is evident that, according to the definition given in Section 3, 
CRG is just a restriction of the UG. Secondly, Inverse narrowing was defined 
as an operator that generates an equation from two equations. On the contrary, 
SIR generates two equations from one equation. This operator, iterated and com- 
bined with the other operators of the IFLP Schema is a generalisation of inverse 
narrowing as the following corollary shows: 



Corollary 1. For any equation s = t such that we can make an inverse narrow- 

u,l=r,0 

ing step: t ^ p t to obtain a pair of equations s = t and I = r, then these 
two equations can be obtained in the IFLP schema. 

Proo f. Just apply the operators of the IFLP schema to obtain a first equation Dk as: 
s = q, and a second equation Ek as: such that u> — u, where q — t[l\^ and 
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t\uj — T- 

The only difference is that t' = 0{q), i.e., a substitution is applied, but this difference 
vanishes if we select q = 0{t[V\L^) and then we generalise the second equation Ek to 
make it match I — r. O 



7 Extending Inverse Narrowing 

As we have stated before, the IFLP Schema should suggest different ways to 
generalise inverse narrowing to cope with more complex cases. In this work, it 
has been shown that if a function symbol did not appear in some convenient con- 
ditions, it could not be induced by inverse narrowing. In this way, we can extend 
inverse narrowing to allow fresh variables on the rhs’s and where the secondary 
equation can be obtained either from a set of generalised equations from the ev- 
idence or by the introduction of a new term function symbol F{X[, X 2 , ■ ■ ■ , X!^) 
into an equation of the form F{X[, X 2 , .... X'^) = Y, which obviously can be 
used in any occurrence of the other equation since Y unifies with anything. 



7.1 Function Invention 

The invention of predicates is an open area of research in ILP [13, 16, 10, 7]. In 
the case of unconditional functional logic programs it is expected that function 
invention would be even more necessary than First Order Horn Logic [16]. 

In our strong completeness theorem, we assumed that Xp = i7+, i.e., all 
function symbols of the program appear in the positive evidence. 

One of the reasons for the introduction of this general schema is that in the 
case where the relation Xp D X~^ is strict, we can extend X~^ with new and fresh 
function symbols of different arities, thus making the invention of new functions 
possible. The set of inventable functions is denoted by 17*, and the SIR operator 
can then construct terms by using function symbols from i7+ U i7*. 

Under the extension of the signature, it is clear that the IFLP Schema is 
able to invent functions. The procedure resembles the approach presented in 
[10], where maximal utilisation predicates are introduced and then refined. In 
our case, they are refined by the possible introduction of function symbols in 
different occurrences at different stages. 

On the contrary, in order to extend our previous inverse narrowing approach 
[3] with function invention, we are forced to act in the reverse way due to the 
nature of this procedure. Our extended inverse narrowing is able to do inventions 
of this kind but adding equations of the form F{a, a, . . . ,a) = Y where F is a 
new function symbol from U* and a is a constant which appears in X~^. These 
equations can be used as secondary equations in an inverse narrowing step. The 
use of inverse narrowing on the Ihs is also required (this was already done when 
learning from background knowledge in [3]). Therefore, the approach becomes 
too general for practical purposes, as the following example illustrates. 
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Example 4- Let us consider the example of inducing the product function from scratch, 
which requires the invention of a function for addition. To do this, make T'* = {+} 
where -|- has arity 2. Given the following evidence: 



(E^) ssO X ssO = ssssO 
(E^) sssO X ssO = ssssssO 
(Eg) sssO X sO — sssO 
(Ef) 0 X sssO = 0 
{E^) ssO X 0 = 0 
(Eq) ssssO X 0 = 0 
{E^') sO X ssO = ssO 



{E^) ssO X sssO = sssssssssO 
{E 2 ) sssO X sssO = ssssssssO 
(Eg) ssO X sssO = ssssO 
(i5^) sssO t ssO = ssssO 
(iSg^ ) 0 X sO = sO 
(Eg) sO X 0 = ssO 



We can proceed as follows. The equation {0 x X = 0} is just a generalisation of Ef. 
From X® = {+}> we introduce the equation -|-(0,0) = Y, which can be expressed in 
infix notation as i5i = 0-1-0 = T. From we introduce the equations E 2 = s(0) = Y' 
and £3 = 0 = Y” . We make inverse narrowing at occurrence e of the rhs of Ei with 
E 3 and we have £4 = 0-1-0 = 0 that can be generalised into X-l-0 = X. By repeatedly 
using inverse narrowing on different occurrences we can obtain the following equation 
X -|-s(X) = s(X + Y). Although, in this case, this involves only three steps, in general 
it would be necessary to use heuristics or mode declarations. Even with all this, some 
systems (e.g. [11]) are helped by some examples of the addition in the evidence. The 
equation sX + Y — X xY can be obtained as was shown in [3] since the equations of 
addition are already generated. 

At the end of the process, the following program can be constructed: P = {0 x X = 
0,sXxX = XxX-bX,X-b0 = X,X-b s{Y) = s{X + X)}. 



8 Conclusions and Future Work 

The IFLP Schema is shown to be a general and strong complete framework for 
the induction of functional logic programs. Theoretically, this allows the induc- 
tion of functional logic programs with auxiliary functions and, if the signature is 
conveniently extended, it can be used to invent functions. Aloreover, although in- 
tricate combinations of the operators which have been presented may be needed 
in order to obtain the rules of the intended program, function symbols are intro- 
duced one by one. This makes extending our previous algorithm with the new 
operators possible, since it is based on genetic programming techniques. 

This theoretical work is a necessary stage in a more long-term project to 
explore the advantages of extending the representational language of ILP to 
functional logic programs. This is also subject to a convenient extension of our 
schema to conditional theories, which will make it possible to use and compare 
the same examples and background knowledge as in ILP problems. A less the- 
oretical ongoing work is centred on the development and implementation of a 
more powerful but still tractable algorithm than the one presented in [3] . 
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Abstract. From the point of view of computational linguistics, Hungar- 
ian is a difficult language due to its complex grammar and rich morphol- 
ogy. This means that even a common task such as part-of-speech tagging 
presents a new challenge for learning when looked at for the Hungarian 
language, especially given the fact that this language has fairly free word 
order. In this paper we therefore present a case study designed to illus- 
trate the potential and limits of current ILP and non-ILP algorithms 
on the Hungarian POS-tagging task. We have selected the popular C4.5 
and Progol systems as propositional and ILP representatives, adding ex- 
periments with our own methods AGLEARN, a C4.5 preprocessor based 
on attribute grammars, and the ILP approaches PHM and RIBL. The 
systems were compared on the Hungarian version of the multilingual 
morphosyntactically annotated MULTEXT-East TELRI corpus which 
consists of about 100.000 tokens. Experimental results indicate that Hun- 
garian POS-tagging is indeed a challenging task for learning algorithms, 
that even simple background knowledge leads to large differences in ac- 
curacy, and that instance-based methods are promising approaches to 
POS tagging also for Hungarian. The paper also includes experiments 
with some different cascade connections of the taggers. 



1 Introduction 

Part-of-speech (POS) tagging is one of the first stages in natural language related 
processing (e.g., parsing, information extraction). The task of a POS tagger is, for 
a given text, to provide for each word in the text its contextually disambiguated 
part of speech tag representing the word’s morphosyntactic category. Depending 
on source language grammar and application needs, a suitable tag set is chosen 
by linguists as basis for further analysis. Tagging is a difficult task in general 
because of the large number of ambiguities in natural language texts. For Finno- 
Ugric languages, like Hungarian in particular, this task is even more complex 
because they have rich morphology and fairly free word- order. While a reduced 
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tag set containing 125 different tags overcomes the difficulty of the huge set 
of morphosyntactic categories in practice, the free-word order makes the POS 
tagging task for the Hungarian language a new challenging problem for learning 
algorithms. 

In this paper, we therefore present a case study designed to illustrate the 
potential and limits of current ILP and non-ILP algorithms on the Hungarian 
POS-tagging task. We have selected the popular C4.5 and Progol systems as 
propositional and ILP representatives, adding experiments with our own meth- 
ods AGLEARN, a C4.5 preprocessor based on attribute grammars, and the ILP 
approaches PHM and RIBL. For the experiments, we have used the annotated 
Hungarian TELRI corpus from the MULTEXT-East project which is currently 
the only available annotated corpus for Hungarian. The corpus contains about 
100.000 tokens (which is relatively small compared to e.g. the Wall Street Journal 
corpus for English with its 3 million words). In the past, the MULTEXT-East 
corpus was already used to test Brill’s tagger generator [4, 17] but there, only 
relatively poor per-word overall accuracy was reached, adding additional moti- 
vation to the experiments with a reduced tag set as described in the present 
paper. 

The paper is organized as follows. In Section 2 we give a brief introduction to 
the POS tagging problem, describe previous work on generating taggers through 
learning, and then focus on issues related to the Hungarian language and the 
dataset that was used. In Section 3 we give our motivation for the design of 
our empirical case study, in particular the choice of learning algorithms. We also 
describe, for each chosen algorithm, how the POS tagging problem is mapped 
to the algorithm’s internal representation. Section 4 contains the results of our 
experiments along with a discussion. The paper ends with a discussion and 
conclusions from our studies along with suggestions for further research. 



2 POS Tagging and the Data Set 



During language processing when a sentence is read each word is labelled by 
the set of its possible morphosyntactic descriptions, called tags. For example, 
the Hungarian word ’het’ can be either a number (’seven’) or a noun (’week’). 
In most of the cases a word’s true tag is uniquely defined by its context in 
the sentence. Since each language has ambiguous words (i.e. that have several 
different taggings), a working tagger must contain a disambiguation module 
choosing the true tag from the assigned tag set for each word. Ideally, the task 
of such a module is defined as follows. Given 

— a set 5 of correctly disambiguated tagged sentences, 

— another correctly tagged set 5query of sentences containing ambiguous words, 
find the prediction of the true tag for each ambiguous word from 5query 
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2.1 Previous Work on Generating Taggers through Learning 

Several methods have been developed for the automatic generation of POS tag- 
gers from annotated corpora. In [5] Cussens described a tagging system for En- 
glish that has been generated by Progol (combined with a selection of the most 
frequent class in those cases where Progol ’s rules did not make a decision). Progol 
was trained on a corpus containing 3 million words and it generated “removable” 
rules for ambiguous tags: whenever such a rule applies to a word, the rule’s tag 
is removed from the set of candidates. Using a simple linguistic background 
knowledge the generated tagger achieved 96.4% per word accuracy on the test 
data^ . 

Daelemans & al. in [7] presented the MET memory-based tagger generator 
system for different languages. They trained their system on a subset of the 
Wall Street Journal corpus (2 million words) for English and achieved 96.4% 
accuracy. Eor other languages they reached the following results: Dutch (95.7%), 
Spanish(97.8%) and Czech(93.6%). 

Eineborg & Lindberg presented a tagging system for Swedish language [9]. 
They applied Progol and in a similar way to Cussens’ approach they learned re- 
movable rules for ambiguous tags. The Swedish UMEA Corpus (1 million words) 
was used, and a 97% accuracy was achieved. 

The learned taggers can be combined in order to get more accurate results. 
Halteren & al. in [13] discussed several possible combinations of four well-known 
tagger generator methods (HMM, Memory Based, Brill, Maximum Entropy). 
They showed that all combination taggers outperform their best component. 
Their best combined tagger named pairwise-voting obtained 97.92% accuracy 
for the LOB corpus (1 million words). 

An application of the Brill’s tagger generator to Hungarian is presented 
in [17]. The TELRI [11] corpus (100.000 words) was used but only 87.5% per 
word accuracy was reached. This empirical result showed that Brill’s method 
has some difficulty with languages which are dissimilar in their characteristics 
from English. 

2.2 The Data Set 

Our experiments were based on an improved version of the TELRI corpus [11] 
in which several incorrect annotations had been corrected. This corpus was pro- 
duced within the framework of ”MULTEXT-East” project whose main goal was 
to prove the suitability of the tag coding convention MorphoSyntactic Descrip- 
tion (MSD) to East-European languages. Orwell’s famous novel, the ”1984” was 
translated into six Eastern-European languages including Hungarian and anno- 
tated by MSD codes. 

Words in the Hungarian language may have several thousand MSD codes. In 
order to reduce this huge set, linguists also propose the CTAG code convention 

^ Per word accuracy is computed on all words of the test text, i.e., including both 
unambiguous words (already tagged correctly by the lexicon) and ambiguous words 
(requiring the learned tagger module). 
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which leads to a smaller set containing only 125 different tags [19]. Since each 
CTAG code represents a set of MSD codes, the CTAG annotated version of a text 
contains at most as many ambiguities as its MSD annotation. Clearly, by using 
CTAGs we lose some linguistic information, however, as it is reported in [19], 
the CTAG annotation still usefully applies to a number of cases of processing 
Hungarian texts. 

In our experiments we used the CTAG annotated Hungarian version of the 
TELRI corpus. From the corpus, we first created a tag lexicon by collecting for 
each word the list of tags that actually occurred in the corpus. This lexicon was 
then used to convert the sentences of the corpus into training data. For example 
the first sentence of the novel 

’’Derillt, hideg dprilisi nap volt, az ordk eppen tizenhdrmat iitottek. ” ^ 

was converted into the list 

(asn, {asn, vmisSs}), wpunct, asn, asn, nsn, vmisSs, 
wpunct, (t, {psn, t}), npn, rg, ms, vmisSp, spunct 

If the tagging of a particular word is ambiguous (according to the lexicon) then 
the corresponding item is a pair the first element of which is the true tag (ac- 
cording to the annotation) and the second is the set of possible tags (according 
to the lexicon). Chapter 1 and 2 in the novel having been used as training data 
while chapter 3 and 4 served as test data. In Table 1 the number of tokens in 
the corpus is presented. 



Category 


Number of tokens 


Training 


Test 


Altogether 


word 


59035 


21673 


80708 


punctuation 


12484 


5235 


17719 




71519 


26908 


98427 



Table 1. The number of tokens in the Hungarian corpus 



3 Empirical Evaluation Design 

The goal of our case study was not to compare particular learning systems, but 
to examine the influence of the unusual and difficult properties of the Hungarian 
language (complex grammar, free word order) on the learning problem. We there- 
fore selected a set of learning systems representative of different system classes 
that have been tried before on POS tagging tasks to see how their performance 
would be affected and whether phenomena observed in other experiments would 
carry over to Hungarian. 

^ It was a bright cold day in April, and the clocks were striking thirteen. 
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To this end, as reference standards we selected two learning systems repre- 
sentative of popular propositional and ILP learners used for tagging problems, 
C4.5 [20] and the C version of Progol 4.4 [18]. Since it appears that for other lan- 
guages, memory-based methods are promising, we further included the instance- 
based ILP system RIBL [10, 3] in the comparison. To examine the strong in- 
fluence of background knowledge (reported e.g. in [5]), we used AGLEARN, 
a learning system that adds additional attributes to the input of C4.5. These 
attributes are specified with the help of an attribute grammar and actually com- 
puted by a parser that is invoked on the context of the focus word. Finally, as we 
wanted to examine the influence of depth bounds (window sizes) on the learning 
problem, we included experiments with PHM [16, 14], a method that is capa- 
ble of efficiently learning logic programs with colored path graph background 
knowledge without depth bound. 

Below we sketch the particular representations used for each of these learning 
systems. However, let us first describe some general properties of our experimen- 
tal setup. First, in contrast to experiments such as [5], where “removable” rules 
were learned, we learned “ehoose” rules, i.e., whenever such a rule applies, the 
rule’s tag is made the chosen (i.e., predicted) tag. Second, in order to increase 
the specificity of the learned rules and also to reduce the high runtime complex- 
ity of the learning systems, we decided to split the learning problem by learning 
rules for each ambiguity class separately, and limited ourselves to those classes 
which occurred at least 50 times in the training set. These ambiguity classes 
represent 92.23% (6667 of 7229) of all ambiguity occurrences in the training 
set and 91.26% (2443 of 2677) of all ambiguous tokens in the test data set. In 
cases of C4.5, AGLEARN, and RIBL we considered a separate learning problem 
for each ambiguity class. For Progol and PHM a separate learning problem for 
choose Ji-fromJi- . . . Ju was created for each of the possible true tags for each 
ambiguity class c = {G , . . . , 

In all training examples, the unique correct tags for surrounding words were 
used. In cases of C4.5, AGLEARN, Progol, and PHM, the learned set of rules 
was used for testing as follows: whenever exactly one rule applied, this was used 
as the learner’s prediction; if none or more than one applied, the test example 
was taken as uncovered and was assigned, as in [5] to the most frequent candidate 
category. Note that a rule cannot be applied if it does not cover the focus word 
or it refers to at least one ambiguous token in the context of the focus word. 
Since the IBL paradigm is able to handle missing attribute values as well, in the 
case of RIBL such ambiguous tokens were considered as objects without class 
information. 

In the rest of this section we describe briefly the different representations we 
employed. The results are discussed in Section 4 in details. 



3.1 C4.5 and AGLEARN 



In our experiments with C4.5 we represented the context of the focus words by 
fixed length lists of the tokens, the chosen window size being 7. Below we give 
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an item from the C4.5 training data file: 

empty, empty, ms, asn, nso, nsnx, wpunct, nsn, vmis3s, asn, npox, aso, t, asn, t 

The first 7-7 columns describe the left, and right context respectively, and the 
last one is the true tag of the current token. If a context was shorter than 7 then 
it was padded with the special value empty. In order to extend the above descrip- 
tion with some basic linguistic knowledge, we used the system AGLEARN [12]. 
AGLEARN is a learning method which extends a propositional learning ta- 
ble with extra columns containing ’’relational” information. The basic idea of 
AGLEARN is very similar to the ILP system LINUS [8], i.e. a relational learn- 
ing problem is transformed into propositional form and the rules inferred by a 
propositional learner are transformed back into relational form. The difference 
is that AGLEARN uses attribute grammar formalism for the relational repre- 
sentation. Since AGLEARN applies C4.5, its application can be considered as 
an extension of the propositional language used by C4.5. 

When setting up a learning task for AGLEARN a simple linguistic back- 
ground knowledge was developed. The aim of this linguistic background knowl- 
edge was to recognize token groups and characteristic phrase structures called 
syntagmas in the context of the focus word. An attribute grammar was used 
for the computation of two new attribute values (group and syntagma). Group 
values were computed for the left and right contexts. The syntagma value was 
computed only for the right context. This value is based on the syntax of the 
context so it cannot be computed for the left context when the beginning of the 
syntactic structure is not known. AGLEARN generated training examples for 
C4.5 in which in addition to the CTAGs these new attribute values were ap- 
peared as new columns. Our findings showed that these new columns in several 
cases appeared in the learned rules [2]. 



3.2 Progol 

As mentioned earlier, for each ambiguity class c = {t\, . . . ,tk}, and for each 
t £ c a separate learning problem was considered over the target predicate 
choose_t_from_c/2. Eor each training token with ambiguity class c we added 
the positive (resp. negative) example choose_t_from_c(L, i?) if its true tag is 
t (resp. is not t). Arguments L and R are the token’s whole left (in reverse 
order) and right contexts, respectively (i.e., there are lists containing the true 
tags). In each Progol learning task we used the following ’’technical” background 
knowledge: 

IJ {s (s) } U {empty ([]), first (T, [T | _] ) , second(T, [_, T | _] ) , third(T, [_, _, T | _] ) } , 
sec 

where C denotes the set of CTAGs. Atoms first, second, and third were 
used to select the Rh token from their second list arguments for i = 1,2,3, 
respectively. Note that the above background knowledge defines a window size 




134 T. Horvath et al. 



of 3. Progol then learned rules e.g. of the form 

choose_cp_from_cp_rp(L, i?) : — first(I/l, L), rg(Ll), second(i?2, i?), nsn(i?2) , 

meaning that a token’s true tag is predicted by cp if its ambiguity class is {cp, rp}, 
its left neighbor is a token of true tag rg, and its second right neighbor has true 
tag nsn. 



3.3 PHM 

The Product Homomorphism Method (PHM) introduced in [16] is a combi- 
natorial approach for learning simple logic programs. By using PHM, positive 
PAC learning results can be proved to different classes of structured background 
knowledges including the class of colored path graphs [16, 15, 14]. The POS 
problem is basically a sequential problem, that can be represented by colored 
path graphs. As PHM does not require any bound on the depth, in contrast to 
the other methods investigated, we do not need any window technique. 

A colored directed graph is given by a tuple G = {V,E,Qi, . . . ,Qi), where 
V denotes the set of vertices, E C V x V the set of edges, and Qi C V for 
i = 1, ... ,l the colors of the vertices. Thus, a vertex may have more than one or 
possible no color. G is a colored path graph if the directed graph (V,E) is a set 
of disjoint directed paths. Given a colored path graph background knowledge B 
and a set of facts E = {P{bi), . . . ,P{bt)}, bi £ V for i = 1, ... ,t, the reduced 
RLGG of E with respect to B can be computed in time O {I ■ t ■ Tmax) and space 
O {I ■ Tmax), where Tmax is the length of the longest path in B [14]. 

In order to apply PHM, the POS problem must be represented by using 
colored path graph background knowledges. In our experiment with the PHM 
we used the next background knowledge B defined by the training set as follows: 

— a constant identifying the jth word in sentence i was introduced for every 
possible i,j, 

— for every consecutive words identified by a and b, respectively, a fact R{a, b) 
was added to B (hence, the ground i?-atoms in B form a set of disjoint 
directed paths), and 

— for each constant a, a fact t{a) was added to B where t is the true tag of the 
word identified by a (hence, t can be considered as a color). 

For each ambiguity class c = {t\,. . . ,tu} and for each possible tag t £ 
c we considered a separate learning problem with background knowledge B 
and training examples E~^{c,t) U E~{c,t) with respect to a target predicate 
choose_t_from_c/l, i.e., 

— E~^{c,t) is the set of the atoms choose_t_from_c(a) for every training in- 
stance (token) a with ambiguity class c, and true tag t, and 

— E~{c,t) is the set of the atoms choose_t_from_c(a) for every training in- 
stance a with ambiguity class c, and true tag different from t. 
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As the problem of finding a consistent hypothesis consisting of k clauses with 
the smallest k is NP-complete [14], we used a simple variant of tabu search [1] 
with tabu size 5 maximizing the relative frequency function nconect/n, where 
t^correct IS the number of training examples correctly classified, and n is the 
number of the covered training examples. A clause discovered was accepted if 
its accuracy on the covered training examples was at least 90%, and it covered 
at least 10% of the positive examples. 

3.4 RIBL 

RIBL is an instance-based ILP algorithm first introduced in [10] and further 
developed in [3]. Although there are a number of user-defined parameters con- 
trolling the automatic similarity measure used by RIBL, one of the basic practical 
problems is the choice of an adequate representation of the learning problem. In 
the case of the POS problem, we considered two different representations. 



Representation with Lists In this non-flattened representation we used a 
ternary predicate for describing the background knowledge. Its first input argu- 
ment was used to identify instances (i.e., tag occurrences), and its second, and 
third output arguments, both of type list, were used to describe the left and right 
neighbors (i.e., the true tags) of the instance identified by the first argument. In 
both directions we considered only two words at most (i.e., their true tags). For 
each ambiguity class c = {t\, . . . ,tk} investigated we took the set of training 
instances with ambiguity class c and for each x € we added a target atom 
t{x) to the set of training instances, where t is the true tag of the instance iden- 
tified by X. In order to compare arguments of type list, we used list edit distance 
supported directly by RIBL [3]. We used trivial cost [3] on the list alphabet (i.e., 
the set of possible tags). 



Representation with Relations In this representation we used a binary back- 
ground relation R describing the chain of the words, i.e., the background knowl- 
edge contains a fact R{a, h) if and only if the two instances identified by a and 
b, respectively, are consecutive in this order. For each training instance a with 
true tag t we also added the fact true_tag(a, t) to the background knowledge. 

As in the previous representation, for each ambiguity class c we considered 
a separate learning problem with the same training instances as in the previous 
non-fiattened representation. 



4 Experimental Results and Discussion 

In this section we discuss the experimental results of the five learning systems 
on the domain described in Subsection 2.2. Since AGLEARN used extra lin- 
guistic background knowledge and is based on C4.5, the results with C4.5 and 
AGLEARN are discussed in the same item. 
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C4.5 and AGLEARN C4.5 could not disambiguate only 15 test queries, in 
other words, it made a decision in 99.4% of the 2443 ambiguous test queries. 
It predicted 1973 instances correctly, which is a 80.8% test accuracy for the 
whole test set containing 2443 instances. We also note that this corresponds to 
a relative accuracy of 81.3% for the set of 2428 disambiguated test examples. 

In the extended representation used by AGLEARN, the number of undecided 
cases was 72, i.e., it disambiguated 97.1% of the test instances. On the other 
hand, even for this smaller set, the number of correctly classified queries was 
2045, i.e., 83.7% of the test examples. This is a 86.3% relative accuracy with 
respect to the decided cases. 



Progol We got slightly different results with Progol. Progol could decide only 
in 1606 cases, that is, only in 65.7% of the test domain. A similar experimental 
result with Progol is also reported in [5]. On the other hand, Progol classified 
1482 examples correctly, i.e., 60.7% of the 2443 test instances. This is an excellent 
92.3% relative accuracy with respect to the decided 1606 cases. 



PHM The rules discovered by PHM disambiguated 1965 cases from the 2443 
ambiguous test queries, i.e., 80.4% of the 2443 test instances. 1776 cases, namely 
72.7% of the 2443 examples, were correctly classified, which corresponds to a 
relative accuracy of 90.4% for the disambiguated 1965 cases. 



RIBL In the non-fiattened representation the optimal number of neighbors 
A: = 11 was estimated by the leave-one-out evaluation method on the training 
set. Since the true tags in the window of a test instance may be unknown, we 
introduced a special symbol. The cost of inserting, deleting or replacing this 
symbol with any symbol was set to 1. In contrast to the other four systems, 
RIBL did not leave undecided cases, and correctly classified 84.8% (i.e., 2072 
out of 2443) of the test set. 

The best training (as well as test) accuracy was achieved by depth bound 
3 in the second representation. In this case, RIBL correctly classified 1794 test 
instances from the 2443, which is a 73.4% test set accuracy. This result show 
that a significant improvement on the accuracy can be gained by using lists with 
edit distance on this domain (see also [3]). 

In the remaining part of this section we summarize the above mentioned 
experimental results. For RIBL we take the result with the list representation. 
C4.5, AGLEARN, and RIBL have very good disambiguation capabilities, that 
is, the number of undecided test queries was small for these systems. In con- 
trast to RIBL, C4.5, and AGLEARN, the systems Progol and PHM could not 
disambiguate a significant part of the test queries. There was also a difference 
between Progol and PHM in this sense. On the other hand, Progol and PHM 
had the best relative accuracy for the decided cases, and Progol was better than 
PHM. We also note, that although the extra linguistic background knowledge 
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used by AGLEARN increased the number of undecided cases, it improved the 
results with C4.5. 

In order to compare the previous results, we disambiguated the undecided 
test queries by applying the simple tagger based on lexicon frequencies. The 
accuracy of this tagger was 76.1% on the set containing the 2443 ambiguous 
test instances. The final results are listed in Table 2. The table also contains the 
per-word overall accuracies without punctuation symbols. In order to compute 
them, we used the lexicon frequencies for the uncovered 234 cases. The line 
’ambiguous’ in Table 2 illustrates the accuracies for the 2443 ambiguous test 
instances covered by the 27 classes investigated. 





C4.5 


AGLEARN 


Progol 


PHM 


RIBL (lists) 


ambiguous (%) 


81.0 


84.8 


82.8 


84.6 


84.8 


overall per-word (%) 




98.03 


97.80 


EEHilil 


98.03 



Table 2. Final test set accuracy results expressed in percentage terms. 



4.1 Cascade Connection of the Taggers 

As mentioned earlier, our goal was not to compare the different learning systems, 
but to select the candidate components of a working combined tagger. As a first 
step into this direction, we investigated some cascade connections of the previous 
taggers. These combinations first employ systems which are very accurate but 
have low coverage (e.g. Progol and/or PHM) and fall back to RIBL (which 
leaves no uncovered instances) only for the undecided tokens. The results of 
some combinations investigated are given in Table 3 (the accuracy is computed 
with respect to the 2443 ambiguous test instances considered). 





Progol — )■ RIBL 


PHM RIBL 


Progol PHM RIBL 


accuracy (%) 


85.7 


86.5 


86.0 



Table 3. Results with different cascade connections of Progol, PHM, and RIBL. 



Although the cascade connection is a simple combination in contrast e.g. 
[13], the results show the potential of this approach. An interesting observation 
is that the cascade connection of Progol with PHM did not improve the accuracy 
which means that PHM has weaker decision capability on the undecided tokens 
left by the Progol tagger. 

5 Conclusion 

In this paper, we have presented an initial case study with five selected ILP and 
non-ILP algorithms on the problem of POS tagging for the Hungarian language. 
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The purpose of the experiments was not to compare the learning systems, but 
rather to examine how the unusual and difficult properties of Hungarian (com- 
plex grammar, large tag set, free word order) affect the properties of the problem. 
Our experiments indicate that Hungarian is indeed a particular challenge to the 
learning of POS tagging rules, whether using propositional or ILP methods. 

More specifically, however, some generally observed properties of the problem 
seem to carry over from simpler languages to Hungarian. First, in [13], it was 
reported that memory-based methods (MBL) are particularly suited for learning 
POS taggers. Given the accuracy figures reported above, our experiments indi- 
cate that this might also be the case for the POS tagging problem in Hungarian: 
the instance-based ILP system RIBL shows good results on this domain both in 
coverage and accuracy aspects. Secondly, as already emphasized e.g. by Cussens 
[5] , the use of linguistic background knowledge is of great importance and can- 
not be offset by the use of more powerful learners. In our experiments, even a 
very simple set of linguistic background knowledge, provided by AGLEARN to 
C4.5, has notably improved the learning results. Thirdly, as in the experiments 
by Cussens for English [5], we found that also for the Hungarian POS tagging 
problem, both rule-inducing ILP learners have very good accuracy in those cases 
when they could decide, however the number of uncovered cases is relatively high 
for these learners, indicating some need for further generalization ability e.g. by 
giving additional background knowledge. 

In future work on continuing this case study, we are therefore going to use 
additional linguistic background knowledge provided by an expert in order to 
improve accuracies of the three ILP systems Progol, PHM, and RIBL. Eor RIBL 
in particular, this may also mean using an expert-provided editing cost matrix 
instead of the trivial one used in the present experiments. In the cases of Progol 
and PHM, a test query remained undecided if it was covered by no clause or 
by two or more clauses predicting different tags. In order to reduce the number 
of undecided instances, we are going to apply Bayesian method for these cases. 
Bayesian methods for POS tagging have also been investigated in [6]. Eor the 
PHM approach, we plan improvements using local search. Einally, given the 
encouraging results of [13] on taggers that combine different learning systems, 
we are going to investigate other more sophisticated combinations as well. 
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Abstract. There is a history of research focussed on learning of shift- 
reduce parsers from syntactically annotated corpora by the means of 
machine learning techniques based on logic. The presence of lexical se- 
mantic tags in the treebank has proved useful for learning semantic con- 
straints which limit the amount of nondeterminism in the parsers. The 
level of generality of the semantic tags used is of direct importance to 
that task. We combine the ILP system Lapis with the lexical resource 
WordNet to learn parsers with semantic constraints. The generality of 
these constraints is automatically selected by Lapis from a number of op- 
tions provided by the corpus annotator. The performance of the parsers 
learned is evaluated on an original corpus also described in the article. 



1 Introduction 

In all but the simplest natural language corpora, the grammars used for syntactic 
annotation are ambiguous, i.e. for some sentences they generate more than one 
parse. When learning parsers from treebanks, that is, text corpora in which 
sentences are annotated with their parses, two of the central issues in the parser 
design are the reduction of the number of parses, and the choice of the best parse 
out of the remaining ones. The former issue can be addressed by adding new 
constraints to and modification of the grammar used by the parser. If the result 
is a deterministic grammar, then the latter issue mentioned is automatically 
settled. Otherwise, probabilistic techniques can be applied to assign a probability 
to each parse, and choose the most probable one. 

There is a history of research focussed on limiting or removing the nonde- 
terminism from shift-reduce parsers using machine learning techniques based on 
logic P. PI The latter work introduces the system Lapis in which lexical 
semantic tags present in the treebank are used to learn semantic constraints li- 
miting the amount of nondeterminism in the LR parsers produced by the system. 
The grain of the semantic tags used is of direct importance to that task. Too 
general semantic tags would result in rules with poor discriminating ability, whe- 
reas too specific ones would not permit sufficient coverage of unseen examples. 
It is possible to use the lexical database WordNet [2| for semantic tagging, as 
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it provides several concepts of various generality for each sense of a given word. 
The choice of a single concept out of those to be used as a semantic tag should 
not be made by the corpus annotator, because, as Zelle remarks (0, p.32), “it 
is unlikely [that] a hand- crafted feature set will be complete enough. . . to make 
the distinctions that are necessary for accurate parsing in realistic domains”. 

The present work implements a technique previously outlined by the aut- 
hor 0 allowing the use of WordNet in a flexible framework, where nouns and 
verbs in the corpus are tagged with a list of all relevant tags, and for each word 
the tag which is optimal, i.e. best suited for disambiguation, is automatically 
selected by the learning system. The method is tested on an original corpus with 
suitable annotation. 

2 LR Parsers 

LR parsers belong to the family of shift-reduce parsers. The latter are bottom-up 
directional parsers which use a stack to keep the intermediary results. Two of the 
basic actions carried out by all shift-reduce parsers are to shift a symbol from 
the input stream onto the stack, and to reduce a number of symbols on top of 
the stack, i.e., replace them with another one. Characteristic for the LR parsers 
is that they cannot handle ambiguous grammars, on the other hand, they can 
detect a syntactic error as soon as it is possible to do so on a left-to-right scan 
of the input JQ- 

The LR parser stack always contains an odd number of elements. Starting 
from the bottom with the initial state of the parser, all odd slots contain indices 
of parser states. The even slots of the stack contain grammar symbols, either 
terminals or nonterminals (see Figures The current state of the parser is 

the one on top of the stack. Each internal state corresponds to a set of parti- 
ally processed grammar rules which share the RHS symbols seen so far. Each 
subsequent action of the parser is selected from a parsing table according to the 
current state of the stack and a number of lookahead input symbols (a single one 
for the parsers studied here) . LR parsing tables are always created with the help 
of a software tool, such as the well-known Yacc p. The attempt to construct a 
parsing table for an ambiguous grammar results in a table containing conflicts, 
i.e. for certain pairs of states and lookahead symbols there are several possible 
actions in the table. Starting from the initial state 0, four actions are possible 
for a given State and input Symbol. 

Accept If State=EndState and Symbol=end_of _string, then halt. The input 
has been successfully parsed (see Figured). 

Shift/NewState Push the current input symbol onto the stack and transit to 
NewState, i.e. put NewState on top of the stack (see Figure GJ. NewState is 
chosen from a parsing table, where it is selected by State and Symbol. 
Reduce/RuleNo Remove from the top of the stack the number of couples 
(state , symbol) corresponding to the number of symbols in the RHS of the 
grammar rule Rule, check out the state OldState now on top of the stack, 
push the LHS of Rule onto the stack, and transit to state NewState, i.e. 
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push it onto the stack (see Figure NewState is selected by OldState 
and LHS in the part Goto of the LR parsing table. 

Error Fail. The current state and lookahead symbol indicate that an ungram- 
matical sentence is being processed. 

For a given parsing table, the corresponding LR parser is easily represented as 
a logic program. The representation described here is similar to the one used 
by Zelle and Mooney P). The stack and input are represented as lists, and the 
tagged words have the format Word;PoS~SemTag. Then the actions accept and 
error can be represented as predicates of two arguments (Stack, Input) (see 
Figure[3). The examples shown read as follows: “If in State 1 the lookahead input 
symbol is tagged as end-of-sentence, accept the sentence and print the parse” (cf. 
Figure^), “Announce an unsuccessful parse if in the initial state 0 the lookahead 
symbol is not a verb, noun, or determiner” . The actions shift and reduce are 
similarly represented as predicates with two pairs of arguments describing the 
stack and input before and after the action. Figure 0 also shows a simplified 
version of the parser as defined by the predicates parse/1 and do_parse/2. 



3 Related Work 

In his PhD thesis |Z|, Samuelsson describes a technique for the development 
of an efficient LR parser based on Explanation-Based Learning (EBL) jSj and 
entropy-related information measures. The method uses a treebank to modify the 
provided set of grammar rules and replace some of them with partially unfolded 
ones. The new rules do not cover part of the syntactic readings, if those are 
only marginally represented in the treebank. As a result, the grammar is less 
ambiguous at the price of a certain loss of coverage. Using that grammar also 
results in considerably faster parsing, due to the pruned search space and the 
larger number of symbols in the right-hand side of the unfolded grammar rules, 
which are reduced by the LR parser in a single step. 

Zelle and Mooney El) Q have developed a method for learning case-role 
grammars from a treebank in which the nonterminal nodes correspond to deep 
cases, such as Agent, Patient, Instrument, etc. The treebank is used to construct 
an overgeneral shift-reduce parser covering the sentences in the treebank. The 
parser is then specialised and made deterministic by using ILP. Whenever ne- 
cessary, lexical semantic classes are automatically defined in order to achieve 
deterministic parsing. In the experiments with case-role grammars, the learning 
algorithm Chill “consistently invented interpretable word classes” 0, such as 
cinimate, human, or food. 



4 Lapis 

Lapis is a system which builds on Zelle and Mooney’s research on the induction 
of shift-reduce parsers and extends it to learning of LR parsers, while chan- 
ging at the same time the focus of the learning task. Lapis, similarly to Chill, 
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Stack Input Buffer Slack Input Buffer New Stack New Input Buffer 



(a) accept (b) shift 

Fig. 1. LR parser accept and shift actions 




Stack Input Buffer New Stack New Input Buffer 




r3 r2 r4 rl accept 




Legend: O internal parser stale 

(?) end parser stale 
s8 action ID 
2 stale No 



correct action (leading to correct parse) 

" ^ wrong action (a step aside from the 
sequence of correct actions) 



(a) LR parser reduce action 



(b) Generation of positive and 
negative examples of actions 



Fig. 2. 

parse (ListOf Tokens) do_parse([0] ,ListOf Tokens) . 

y.y.y. 

do_parse (stack, Input) accept (Stack, Input) . 

do_parse (Stack, Input) tagLookAheadSymbol (Input, TaggedI) , 

(action_s (Stack, TaggedI ,NewSt ,NewI) ; action_r (Stack, TaggedI ,NewSt ,NewI) ) , 
do_parse (NewSt ,NewI) . 

accept (Stack, Input) . Stack=[Top, . . . ,Bottom] . Example: 

accept ( [1 , Parse e_o_s*_ I _] ) : - write (Parse) . 

yyo’/. error (Stack, Input) . Input= [LookaheadSymbol I More] . Example: 
error ( [0 I : AnyOf I _] ) : - \+ member (AnyOf , [v, det ,n] ) . 

'LVL action_s (Stack, Input ,NewStack,NewInput) . Example of shift action: 
action_s( [0 I Stack] , [n: Word~STag| Input] , [2 ,n:Word“STag, 0 I Stack] , Input) . 

yyo’/. action_r (Stack, Input ,NewStack,NewInput) . Example ( S — > n VP ) : 
action_r ( [3 , vp(VP) , St at el ,n: Word*STag,State2 I Rest Of Stack] , InputBuf f er , 
[NewSt ate , s ( [n: Word'STag, vp(VP)] ) ,State2 I Rest Of Stack] , InputBuf fer) : - 
goto(State2 , s ,NewState) . 

yyy 

goto(0, s , 1) . 



Fig. 3. Logic representation of LR parsers 
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Lexicon 

Word, POS , SemTag 




Fig. 4. Lapis 



uses lexical semantic classes as constraints for parse disambiguation. However, 
the assumption made in Lapis is that semantic classes which are useful in di- 
sambiguation are present in language as words and idioms. The existence of a 
comprehensive source of such classes (WordNet), and even treebanks annotated 
with those classes (parts of the Brown Corpus 0), poses the question whether 
WordNet semantic classes should not be used rather than learned from scratch. 

The system Lapis constructs LR parsers from treebanks of parse trees annot- 
ated with lexical semantic tags. Lapis aims at the reduction of nondeterminism 
in the parsers it creates. This is achieved, as it will be shown further, by the 
means of lexicalisation and partial unfolding of the underlying grammar rules, 
in combination with the use of lexical semantic constraints. 

The number of recursive grammar rules that are to be learned, and the lack 
of explicit negative examples make the learning of tools for syntactic analysis 
a particularly difficult task. In Lapis, parser actions are learned rather than 
grammar rules. Although grammar rules and parser actions are closely related, 
the former are recursive concepts, whereas the latter are not. 

For all trees in the treebank, each node and its immediate descendants re- 
present the RHS and LHS of a CFG rule. The CFG derived from the treebank 
is used in Lapis to define the for each sentence the set of all syntactically cor- 
rect parses, then some of the parses, which are not present in the treebank, are 
used as negative examples. Again, learning parser actions instead of grammar 
rules allows to reduce the number of examples, only considering partial parses 
(subtrees) which differ from the correct parse in one, the topmost so far, node. 

Lapis (Learning Algorithm for Parser Induction Systems) consists of the 
following procedures (see Figure^. 

Step 1 For all trees in the treebank, the context-free grammar rules correspon- 
ding to all pairs <ParentNode,Daughters> are found. The result is a CFG 
which can produce all syntactic trees in the treebank. That CFG is usually 
ambiguous and produces a great number of spurious parses. 
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Procedure Inductive learning of parser actions 

Initialise SetOfGenCl := where is the first positive example 

Initialise := List of all positive examples hut 
Initialise e” := List of all negative examples 

Retract a positive example from e"*" 
repeat 

Get next clause NextGenAction from SetOfGenCl 
Find LGG:= lgg{NextGenAction,e^jj.,) 
until ((LGG does not subsume any example from e“, NewLGGFound:=true) 
or (there are no more clauses in SetOfGenCl. NewLGGFound:=false)) 
if NewLGGFound=true then 

Retract NextGenAction from SetOfGenCl 
Retract from SetOfGenCl all clauses subsumed by LGG 
Put LGG in SetOfGenCl 
else Add to SetOfGenCl 
until e'*’ is empty 
return SetOfGenCl 



(a) Least general generalisations (b) Inductive learning of parser actions 

of parser actions 



Fig. 5. Step 4 



Step 2 Using the CFG rules obtained, the parsing table of a possibly nonde- 
terministic LR parser is generated with Yacc. Each possible parser action 
(transition of that parser from one state to another) is represented as a 
Prolog clause. The logic program obtained can use the Prolog built-in back- 
tracking mechanism to deal with any nondeterminism in the parser, and thus 
produce for a given sentence all possible parses. 

Step 3 For each sentence in the treebank, the nondeterministic LR parser from 
Step 2 is used to generate the sequence of parser actions leading to the 
correct parse as shown in the treebank (Figure Eb)- Each of these actions is 
an instantiation of one of the clauses generated in Step 2. 

For each parser state in which nondeterminism occurs, e.g. State 7 in Fi- 
gure Eb, the instantiation of the correct parser action (r2 in the example) 
is stored as a positive instance of that action. All other possible instantiati- 
ons of parser actions for the given state and input symbol (here r5 for any 
input) are also generated and stored as negative instances of those actions 
(a detailed explanation with examples can be found in |3|). 

Step 4 In the last step, the positive and negative instances of each parser action 
are used as input of a greedy ILP learning algorithm (shown in Figure Eb) 
based on syntactic least general generalisation 0 to learn a new definition 
of that action. The new definition is a set of clauses subsuming all positive 
examples, but none of the negative (see Figure Eb), and is usually more 
specific than the CFG-based definition of action generated in Step 2. If the 
former definition is substituted for the latter, the new parser obtained usually 
contains less nondeterministic states and hence does not produces some of 
the spurious parses generated by the GFG-based parser from Step 2. 

The new parser can be viewed as using a unification grammar based on the 
backbone GFG of the parser from Step 2, but with rules additionally specialised 
(their coverage decreased). The difference between the GFG rules and those 
learned is threefold. 
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s 




prep noun 



John eats pasta with cheese 
S noun verb noun PP 



(a) Mapping from word-forms (b) Grammar rule 

and lexical entries to synsets and partial unfolding 

their hypernyms in WordNet 



Fig. 6. 



1. Where the CFG rule is only looking for a terminal with a particular part 
of speech (e.g. noun), the new rule additionally requires that the terminal 
should belong to some semantic class. 

2. In the new rules, a PoS can be narrowed down to a single word. In other 
words, the rules learned are partially lexicalised. 

3. Some of the nonterminals in the RHS of the new rules are unfolded, i.e. re- 
placed by some of their descendants in the parse tree, as shown in Figure Eb- 

It can be seen from the rules learned that Lapis combines rule unfolding with 
the use of lexical semantic tags to disambiguate the grammar. That, along with 
the partial lexicalisation makes the new grammar rules more specific than the 
original CFG ones, so that they produce a lower number of spurious parses. 



5 WordNet as a Source of Semantic Tags 

WordNet is an on-line lexical database which contains syntactic and semantic 
information for a large number of words and idioms. Originally developed for 
English P], WordNet is also being implemented for other languages. The central 
building element of WordNet is called synset, or lexicographer’s entry. A synset 
is a set of words or idioms which share a common meaning. For instance: {(to) 
shut, (to) close}. To simplify the internal representation, each synset is assigned 
a large integer used as a unique identifier. For instance, [monetary resources, 
funds} is Synset 109616555 in WordNetl.6. WordNet uses a set of rules and 
lists of exceptions to map word-forms to all relevant lexical entries. Figure Eb 
shows the word-form ‘funds’ which is recognised by WordNet as corresponding 
to two lexical entries, ‘fund’ and ‘funds’. The lexical entry ‘fund’ appears in three 
synsets: [store, fund}, [fund, monetary fund} , and [investment company, fund}. 
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respectively. The lexical entry ‘funds’ only appears in the synset {monetary 
resource, funds}. WordNet describes several semantic relations between synsets, 
such as meronymy (PART-OF), hypernymy or hyponymy. The latter two are 
identical up to the order of their arguments, and represent the two directions of 
the IS-A relationship, i.e. they bind a synset to a more general one: ‘medium of 
exchange’ is the hypernym of ‘money’, resp. ‘money’ is a hyponym of ‘medium 
of exchange’. Once the mappings Word-form Lexical entry and Lexical entry 
— > Synset are carried out, WordNet semantic relations can be queried. Figure Et 
displays all hypernyms possibly related to the word- form ‘funds’. WordNet also 
contains a similar hierarchy of hypernyms for the verbs. In terms of graph theory, 
each of these two hierarchies can be described as a directed acyclic graph. 

In the context of corpus-based parser learning, where the correct grain of 
semantic classes is difficult to be forseen, such a hierarchy of semantic tags 
can provide the necessary material for a flexible choice of the most appropriate 
semantic tags to be embedded in the parser. The idea is to allow the learning 
algorithm automatically to choose the appropriate level of generality. 

That goal can be achieved by using lists as lexical semantic tags of the nouns 
and verbs in the treebank. Each noun or verb is tagged with a single list of 
hypernyms, from the most general to the most specific, ending with the WordNet 
synset representing the meaning of that word in the given context. Most of the 
synsets (word sense) have one immediate hypernym; in the opposite case usually 
one of the hypernyms is more relevant to the context, e.g. person-^life form, 
person-^ causal agent, and it is selected by the annotator. Typical examples of 
words so tagged follow; to improve readability, synset indices have been replaced 
with one of the synonyms to which they correspond. 



"/,Word:PoS“SemanticTag pos/neg example 

chauffeur :n“ [entity, causal_agent , operator .driver , chauffeur] (+) 
driver : n* [entity , causal_agent , operator , driver] (+) 

friend:n* [entity, causal_agent, person, friend] (-) 

Imagine the words so tagged appearing in otherwise identical instantiations of 
the same parser action, with the first two examples being positive and the last one 
negative. The generalisation step of the ILP algorithm will result in the following 
constraint for that terminal: Word:n“ [entity, causal_agent, operator, driver |_] . 
Note that the internal representation of lists as binary trees makes possible the 
least general generalisation of lists of different length. In the current implemen- 
tation of Lapis, the most specific semantic tag which covers a number of positive 
examples, and is consistent with the negative, is learned. In general, there is a 
number of semantic tags covering the same positive and none of the negative 
examples, e.g. operator and driver in our example. It is possible to modify the 
algorithm according to the NLP application, so that the most general of those 
semantic tags is kept instead of the most specific one. 
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6 Dataset 

The lexicon LexSOOO, used in the experiments with structured semantic tags, 
consists of the following files and predicates Q 

1. The files noun. lex and verb. lex contain the predicates nlx/2, and vlx/2 
respectively, which map noun and verb word-forms into lexical entries. For 
instance: vlx (beginning, begin) . 

2. The files noun_path_i . pi and verb_path_i . pi contain the predicates npath/2 
and vpath/2 respectively, which map a lexical entry into one of the possi- 
ble semantic tags. For the sake of brevity, the original synset indices used in 
WordNet have been replaced by shorter, yet unique identifiers, vpath (begin, 
[i948,i949,i950]) . 

3. The file i_map.pl contains a single predicate, imap/2 which maps the iden- 
tifiers used in the predicates npath/2 and vpath/2 to the corresponding 
synset indices in WordNetl.6, e.g. imapdlO, ’100011937’). 

4. The file ambig_lex . pi contains a mapping from word-forms to lexical entries 
for the words other than nouns and verbs, along with a single top-level 
predicate diet (WordForm,PoS ,TopDownPath) serving as a uniform interface 
to the whole lexicon. 

The ratio between word-forms, lexical entries and semantic tags (paths of hy- 
pernyms) is 73 : 74 : 265 for the nouns, and 26 : 27 : 183 for the verbs. There are 
84 different word-forms in the whole lexicon, and 468 combinations of a word- 
form, PoS and semantic tag, which corresponds to an average of 5.57 tags per 
word-form. 

The treebank T5000 is an artificial resource, generated from 12 sentence 
templates, and the lexicon Lex5000. The treebank contains 5046 sentences and 
their parse trees. All words are tagged with their PoS, and nouns and verbs 
are also semantically tagged. Although the words in Lex5000 can have more 
than one semantic tag assigned, the treebank is generated with the help of an 
additional lexicon (disainb_lex_i.pl), from which only the correct semantic tag 
is taken for each lexical slot in the templates. The treebank consists of clauses of 
the predicate parse (Sentence, Tree). Each of the templates has a unique parse 
tree assigned to the sentences it generates, except for one, which is treated as 
ambiguous. For each of the sentences corresponding to that template, there is 
a pair of parse/2 clauses sharing the same first argument, i.e. referring to the 
same sentence, but assigning a different parse tree to it. 

7 Results and Evaluation 

In the experiments, the treebank T5000 has been split into a training set con- 
taining 4036 pairs <sentence , parse tree>, and a test set containing 1010 such 
pairs. Samples of different size have been subsequently taken from the training 

^ The structure of the lexicon benefitted from the joint work with James Cussens on 
a related task. 
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set, and Lapis applied on them. For each training set sample, 2 different par- 
sers were produced: (1) the CFG-based parser produced in Step 2 of the Lapis 
algorithm; (2) the specialised parser learned in Step 4. Two modifications of the 
latter were considered: (2a) the parser as is, and (2b) the parser in which the 
lexical semantic tags were not used as semantic constraints, i.e. it only differed 
from the CFG-based parser in its use of partially unfolded and lexicalised rules. 

The performance of these three parsers was compared on the test data set for 
two outputs: the first parse, resp. all parses produced for each sentence. The two 
main criteria used for evaluation were accuracy and precision. If c is the number 
of correct parses, w the number of the wrong ones, and u the number of examples 
for which no parse is generated, then accuracy is defined as acc = , and 

precision as prec = . Additionally, we measured the percentage of sentences 

for which a parse was generated, cov = 

In the experiments where all possible parses were produced, the number of 
parses produced for each sentence ^p, and the number of correct parses per sen- 
tence corr were also considered. The results averaged over 5 trials are displayed 
in Tabled and Figure 0 Tabled shows the average times (over 4 trials) for the 
most time consuming Lapis Steps 3-4, and for parsing itself. Time complexity 
of Steps 3-4 is almost linear w.r.t. the treebank size. Parsing without semantic 
constraints takes about 15-25 ms to find the first parse, and 31-56 ms to find all 
of them. Parsing with semantic constraints is slower for two reasons, because of 
the additional lexicon lookup, and also because of the amount of backtracking 
that semantic ambiguity causes. The parser was specially designed to postpone 
semantic tagging until all semantic constraints have been applied to avoid back- 
tracking. However, search for all parses using semantic constraints may become 
very time consuming because of the high level of lexical semantic ambiguity pre- 
sent in the corpus. For that reason no results have been reported on the average 
performance of parsing with semantic constraints for all parses. 

Gomparison between the performance of the GFG-based parser and the one 
learned with Lapis is strongly in favour of the latter. Using lexical semantic 
constraints in the specialised parser descreases accuracy (by an average of 0.57% 
in the first-parse-only case), as the number of covered sentences goes down. At 
the same time, precision is increased (by 1.14% in average). The limited lexical 
repertoire of the treebank makes possible to resolve many syntactic ambiguities 
by lexicalisation of the grammar. With more realistic data, the role of semantic 
classes is expected to increase. 

Even when the semantic constraints are not applied, the average number 
of parses produced for each sentence 1.12 < ^p < 1.42, in comparison to the 
average of 15.12 parses per sentence produced by the GFG-based parser, shows 
reduction of the number of parses by order of magnitude 1. For the two ex- 
treme cases when ^p = 1.42, an average number of 1.22 (1.23) correct parses 
per sentence are actually produced, due to the presence in the corpus of some 
semantically ambiguous sentences, and also probably because some of the spe- 
cialised clauses representing parser actions partially overlap, so producing the 
same parse tree more than once. 
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Fig. 7. Comparison of the three parsers 
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Table 2. Time (in sec.) required for Steps 3-4 of Lapis, and for parsing. 
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8 Conclusions 

The combination of Lapis and WordNet proves its efficiency for the learning of 
specialised parsers, containing a very limited amount of nondeterminism. The 
dataset used, a treebank of semantically annotated parse trees, along with the 
semantic lexicon WordNet leads to the construction of parsers based on gram- 
mars with partially lexicalised and/or unfolded rules, with the additional help 
of lexical semantic constraints. The use of an artificial treebank was justified 
for two reasons. Firstly, the annotation of that part of the Brown Corpus which 
is tagged with WordNet synsets, comes in pairs of files, one for the syntactic 
trees and one for the semantic tags. Discrepancies between the two files make of 
their merging a nontrivial task. Also, we wanted to test the impact of semantic 
constraints on a noise-free data. Handling of noise is an issue on its own, which 
is still to be addressed in Lapis. A promising future line of research is to use 
treebanks with all nodes annotated with their semantics in logic form, with the 
possible result being a parser which only produces parse trees corresponding to 
valid semantic interpretations of the whole sentence. 
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Abstract. In onr previous work we introduced a hybrid, GA&ILP-based 
approach for learning of stem-sufRx segmentation rnles from an unmar- 
ked list of words. Evaluation of the method was made difficult by the lack 
of word corpora annotated with their morphological segmentation. Here 
the hybrid approach is evalnated indirectly, on the task of tag prediction. 
A pair of stem-tag and suffix-tag lexicons is obtained by the application 
of that approach to an annotated lexicon of word-tag pairs. The two 
lexicons are then used to predict the tags of unseen words in two ways, 
(1) by using only the stem and suffix generated by the segmentation 
rules, and (2) for all matching combinations of stem and suffix present 
in the lexicons. The results show high correlation between the constitu- 
ents generated by the segmentation rules, and the tags of the words in 
which they appear, thereby demonstrating the linguistic relevance of the 
segmentations produced by the hybrid approach. 



1 Introduction 

Word segmentation is an important subtask of natural language processing with 
a range of applications from hyphenation to more detailed morphological analysis 
and text-to-speech conversion. In our previous work |S| we introduced a hybrid, 
GA&ILP-based approach for learning of stem-suffix segmentation rules from 
an unmarked list of words. Evaluation of the method was made difficult by the 
lack of word corpora annotated with their morphological segmentation. Here the 
quality of the segmentation rules learned with the hybrid approach is assessed, 
indirectly, through the task of morphosyntactic tag prediction. 

Tag prediction of unknown words is an important preprocessing step perfor- 
med by taggers. However, currently taggers either employ some simple heuristics 
for tag prediction based on the majority class tag, or word affixes In this 
paper we show that word segmentation information can be exploited to predict 



S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 1 fill 1999. 
© Springer- Verlag Berlin Heidelberg 1999 



Learning Word Segmentation Rules for Tag Prediction 153 



the possible tags of unknown words with high accuracy. There are several me- 
thods which would not require the segmentation of training words to learn tag 
prediction rules from tagged lexicons. The tag predicion task is employed here 
to prove the close correlation between morphosyntactic tags and word segments 
produced by our rules. Success in this task would also imply the possibility of 
using those segments as a substitute for morphosyntactic tags when learning 
NLP tools from unnanotated corpora. 

An advantage of our approach is that it does not require a presegmented 
corpus for training. Instead, the system can be trained by supplying it with the 
same kind of lexicon of word-tag pairs as the one used in taggers. 

In our previous work, we have described the hybrid approach combining unsu- 
pervised and supervised learning techniques for generation of word segmentation 
rules from a list of words. A bias for word segmentation ^ is reformulated as 
the fitness function of a simple genetic algorithm, which is used to search for the 
word list segmentation that corresponds to the best bias value. In the second 
phase, the list of segmented words obtained from the genetic algorithm is used as 
an input for Clog 0 , a first-order decision list learning algorithm. The result is 
a logic program in a decision list representation that can be used for segmenta- 
tion of unseen words. Here an annotated lexicon of word- forms is used to assign 
morphosyntactic tags (or descriptions, MSDs) to each of the segments, and so 
build two annotated lexicons of stems and endings. The result is interpreted as 
a generative word grammar. The pertinence of this grammar is evaluated on 
the task of MSD prediction for unseen words, with and without the additional 
constraint of a single-word segmentation generated by the decision list learned 
in the previous step. 



2 Overview of GA&ILP Learning of Segmentation Rules 

This section provides a brief review of our hybrid GA&ILP approach, and for 
more details the reader should consult the paper in which the approach was first 
introduced 1^. 



2.1 Naive Theory of Morphology as Word Segmentation Bias 

Given a list of words segmented into stem-suffix pairs one can construct a pair 
of lexicons consisting of all stems, and suffixes, respectively. 

The Naive Theory of Morphology (NTM) bias ^ prefers segmentations which 
reduce the total number of characters N in the stem and suffix lexicons. The bias 
is based on the hypothesis that substrings composed out of real morphemes occur 
in the words with a frequency higher than any other left or right substrings 0 
In that way, a theory with a low N would produce lexicons where ‘stems’ and 
’suffixes’ correspond very often to single morphemes or their concatenation. Since 

^ This presumption is limited to the languages in which the main operator used to 
combine morphemes is concatenation. 
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the word list can be stored as a list of pairs of indices <stem, suff ix> along with 
the two lexicons, the bias described can be seen as using Occam’s razor to choose 
the simplest theory corresponding to the given dataset. 

2.2 Genetic Algorithms 

Genetic algorithms (GA) Pj are often used as an alternative approach to tasks 
with a large search space and multiple local maxima. A GA maintains a set of 
candidate solutions called individuals and applies the natural selection opera- 
tors of crossover and mutation to generate, usually in several iterations, new 
candidate solutions from existing ones. A fitness function is employed to rank 
the individuals to determine their goodness. The individuals are represented as 
a sequence of characters of a given, often binary, alphabet. The crossover opera- 
tion constructs two new child individuals by splicing two parent individuals at 
n points. The mutation operator creates a new individual from a single parent 
by randomly changing one of its characters. Individuals are mutated according 
to some mutation probability known as mutation rate. 

The following algorithm, known as a simple genetic algorithm, has been used 
for the purposes of this research. 

Procedure simple genetic algorithm 

1. Initialisation 

a) Create a random population of candidate solutions 
(.individuals) of size popsize. 

b) Evaluate all individuals using the fitness function. 

c) Store the best evaluated individual as best-ever individual . 

d) Set the number of generations to NG. 

2. Generation and Selection 
For NG generations repeat: 

a) Sample the individuals according to their fitness, so that 
in the resulting mating pool those with higher fitness 
appear repeatedly with a higher probability. 

b) Apply crossover with probability crossover-rate. 

c) Apply mutation with probability mutatiou-rate . 

d) Evaluate all individuals using the fitness function. 

e) Update the best-ever individual. 

3. Provide the best-ever individual as a solution. 

2.3 GA Search for Best NTM 

The so described genetic algorithm is used to search the space of possible seg- 
mentations of given list of words and find a segmentation that is minimal with 
respect to the NTM bias. 

The representation of the list of segmented words in the GA framework is 
straightforward. The position of the boundary between stem and suffix in a word 
is represented by an integer, equal to the number of characters in the stem. 
The segmentation of a list of words is represented as a vector of integers (see 
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Fig. 1. Representing the segmentation of a list of words as a vector of integers, and 
creating lexicons of stems and suffixes 



Figure P). A randomly generated population of such segmentations is used as a 
starting point of the GA search. The crossover operator constructs two new child 
chromosomes by splicing two parent chromosomes at one point. The mutation 
operator modifies the split position of a single word either by incrementing or 
decrementing by one the corresponding integer, or by changing it randomly 
within the word length. 

2.4 Segmentation Rule Learning Using Clog 

The GA produces for a given list of words their segmentation, along with a pair 
of lexicons of stems and suffixes. Typically for GAs, the segmentation is only 
near-optimal w.r.t. the bias, i.e. the change of some segmentations would result 
in a better bias value. The reusability of the GA output for unseen words is 
limited. Indeed, one could use the lexicons to segment an unseen word into a 
stem and suffix present in the lexicons. However, if there is more than one such 
segmentation, there is no way to choose among them. In the hybrid GA&ILP 
approach, the ILP system Clog is applied to learn segmentation rules which 
produce better segmentations than the GA alone, and can be used to find the 
best segmentation of unseen words. 

Clog |0| is a system for learning of first-order decision lists. Clog can learn 
from positive examples only using the output completeness assumption 0, only 
considering generalisations that are relevant to an example. In the current imple- 
mentation these generalisations are supplied by a user-defined predicate which 
takes as input an example and generates a hard-coded list of generalisations that 
cover that example. The gain function currently used in Clog is user-defined. 
For the segmentation problem we chose the following simple gain function: gain 
= QP - SN - C where QP denotes the number of new examples covered posi- 
tively, SN denotes the number of previously covered examples that are covered 
negatively and C is the number of literals in the clause body. 

The words segmented with the GA are represented as clauses of the predicate 
seg(W, P , S) , for instance: seg([a,n,t,o,n,i,m,i,h] , [a,n,t,o,n,i,m] , [i,h]). 
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Then seg/3 is used as a target predicate with mode seg(+ , ? , ?) of the inductive 
learning algorithm. Also, the predicate append/3 is used as intentional backgro- 
und knowledge, and the range of the theory constants is limited to the set of 
stems and suffixes that appear in the predicate seg/3. In fact, either of the last 
two arguments of seg/3 is redundant and the second of them was omitted in the 
real input dataset, which resulted in the format seg( [a,n,t ,o,n, i,m,i ,h] , [i,h] ) . 
The result of ILP learning is an ordered list of rules (non-ground logic clauses) 
preceded by a list of exceptions represented as ground facts, such as the fol- 
lowing example: seg( [a,n,t,o,n,i,m,a] , [i,m,a]). The exceptions do not have 




Fig. 2. GA&ILP word segmentation setting 



any impact on the segmentation of unseen words, and they are removed from 
the decision list. In most cases, exceptions correspond to imperfectly segmented 
words. When the segmentation rules, with the exceptions removed, are applied 
on the GA input list of words, the result is, in general, a segmentation with a 
better bias value. Figure El summarises schematically the GA&ILP approach. 

3 Dataset 

For our experiments we used part of the lexicon of the Slovene language created 
within the EU Gopernicus project MULTEXT-East |2|. The project developed 
a multi-lingual corpus of text and speech data, covering six languages, including 
Slovene, and lexical resources covering the corpus data. The Slovene lexicon 
contains the full inflectional paradigms for over 15,000 lemmas; it has over half 
a million entries, where each entry gives the word-form, its lemma and mor- 
phosyntactic description. These descriptions are constructed according to the 
MULTEXT-East grammar, which follows international recommendations and is 
harmonised for seven languages. The MSDs contain all morphosyntactic features 
which are relevant to a given PoS. For the 7 parts of speech actually represented 
in the data used here, the number of features is as follows: Noun-8 features, 
Verb-9, Adjective-9, Pronoun-13, Adverb-3, Numeral-10, Abbreviation-1. 
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The MULTEXT-East Slovene lexicon is freely available for research purpo- 
ses. It includes a lexicon of neologisms from the Slovene translation of Orwell’s 
1984- From that lexicon, we used a list of 4383 different word-forms, i.e. with 
homonyms represented only oncell The list was split at random into a training 
set of 3506 words and a test set containing 877 words. Next, the training set 
was divided into disjunctive lists of 100 words each and the genetic algorithm 
was separately run on each of them. Then those lists were merged again. This 
technique proves to be a feasible trade-off between the input data size and the 
time needed by the GA to find a segmentation of high quality. Task decompo- 
sition also makes the GA time complexity linear w.r.t. the input size. Indeed, if 
T is the time required to run the GA on a list of M words for a certain number 
of generations, then the time to apply the GA on a list of K * M words will 
be approximately K * T ii the data set is divided into K separate chunks and 
the GA applied separately on each of them. The described decomposition also 
allows the individual GA runs to be run in parallel. 

The list of segmentated words so obtained was used as input of Clog. As 
a result, a first-order decision list was learned. The decision list contained 736 
exceptions (Figured), and 242 rules (Figure^. Only the rules were applied for 
the segmentation of the training list of words (cf. Figure 1^, thus obtaining a 
segmentation with a better bias value. 



"/oseg (Word, Suffix) . 

seg( [’C’ ,r ,k, o , s ,t , a, V, s ,k,e] , [k, o, s ,t , a, v, s ,k,e] ) !. 

seg( [’C’ ,i,t,a,m,o] , [i , t , a,m, o] ) !. 

seg( [’C’ ,i,t,a,l] , [’C’ , i , t , a,l] ) !. 



Fig. 3. Sample of decision list exceptions 



Up to this point, no use whatsoever was made of the MSDs in the lexicon. 
In the final stage of data preparation, all MSDs which can be assigned to each 
word in the list were retrieved from the lexicon, and the 4-tuples word-stem- 
suffix-possible MSD were stored as clauses of the predicate msd_train/4: 
msd_train( [a,d,o,p,t,i,v,n,i] , [a,d,o,p,t,i,v,n] , [i] , [’A’ ,f,p,m,s,a,y,-,n]) 
There are 10477 such clauses, i.e. each word in the training set corresponds in 
average to 3 morphosyntactic descriptions. The latter are very detailed, ranging 
from part of speech to up to 12 more features. The MSDs are represented as lists 
of one-letter constants, where the lists for a given PoS have the same length. 

Similarly, the 877 words in the test set were annotated with all possible MSDs, 
producing as a result 2531 word-tag pairs tagged_lex( [’C’ ,e,s,t,e,m,u] , 
[’A’ ,f ,p,m,s,d] ) . 

Given the predicate msd_train/4, the following two lexicons can be genera- 
ted: 

^ For technical reasons, the characters c, s, and z were replaced with ‘C’, ‘S’, and ‘Z’ 
respectively. 
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seg(A,B):- append! [s ,u,p, e ,r] , B, A), !. 

seg(A,B):- append! [s,u,g,e,r ,i,r, a, j] , B, A), !. 

seg!A,B):- append! [s,u,g,e,r ,i,r, a] , B, A), !. 

seg!A, [i]) 

append!_, [i] , A), 

append !_, [i,t,e,t,i], A), !. 

seg!A, B) 

append! [p,o,z,i,v,a,j] , B, A), 
append !_, [a] , A) , !. 
seg!A,[]):- append !_, [] , A), 

append!_, [u,j,o,’C’], A), !. 
seg!A,[u]):- append!_, [u] , A), 

append!_, [j ,u] , A), !. 
seg!A, [e,m,a] ) :- append!_, [e,m,a], A), !. 

seg!A, [e,m,u] ) :- append!_, [e,m,u], A), !. 

seg!A, [i ,t , a] ) : - append !_, [i,t,a], A), !. 

seg!A, [i ,m, o] ) : - append!_, [i,m,o], A), !. 



Fig. 4. Sample of segmentation rules 



— stem-tag lexicon containing all stem-MSD pairs contained in msd_train/4, 
e.g. stem_tag([a,d,o,p,t,i,v,n,i] ,[’A’,f,p,m,s,a,y,-,n]). 

— suffix-tag lexicon containing all suffix-MSD pairs contained in msd_train/4, 
e.g. suffix_tag( [i],[’A’,f,p,m,s,a,y,-,n]). 

4 Tag Prediction 

4.1 Method 

The tag prediction task is to predict for a given word W the set of all possible 

MSDs (i.e. tags). This task is broken down into the following stages: 

Segmentation Split W into stem Stm and suffix Suf by either method: 
Method 1 using the segmentation rules generated by Clog 
Method 2 splitting the word into all possible stem-suffix pairs such that 
both the stem and suffix can be found in the stem-tag and suffix-tag 
lexicons. 

Tag prediction Given a segmented word W = Stm + Suf, the prediction of the 
set of tags assigned to W is mainly based on the MSDs in suf-tag lexicon which 
match the suffix. When the stem is present in the stem-tag lexicon, it is used 
as an additional constraint, limiting the number of MSD candidates. 

PoS matching Produce the set of all MSDs selected by the suffix in the suf- 
tag lexicon, such that for each of them an MSD with the same PoS is 
selected by the stem in the stem-tag lexicon, i.e.: 
stem_tag(Stm, [PoS I _] ) , suf f ix_tag(Suf ,MSD) ,MSD= [PoS I _] . 
SufRx-based If the previous step produces an empty set of MSDs, then its 
second constraint is dropped, and the set of tags generated is the set of all 
MSDs matching the suffix in the suf-tag lexicon: suf f ix_tag(Suf , MSD) . 
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4.2 Evaluation 

A variant of the well-known statistics precision and recall have been employed 
to evaluate the performance of our approach for tag prediction. First, some 
metric should be adopted for the purposes of comparison between predicted 
and correct MSDs. The simplest, binary yes/no metric which would count only 
perfect matches was considered too rigorous for the comparison of MSDs with 
up to 13 features. Instead, we employ a finer-grain metric based on the similarity 
sim{MSDi, MSD 2 ) of two MSDs. 

sim{MSDi, MSD 2 ) = ^identical features (1) 

We extend this to the similarity between an MSD and a tag-set {i.e. a set of 
MSDs): 

sim{MSD,TagSef) = max (^sim{M S D , M S D , MSD € TagSet (2) 

For a given word W, let the set of correct tags (MSDs) be TagSetc, and the 
predicted set of tags TagSetp. This similarity measure is incorporated in our 
definitions of precision and recall in the following way. 

Let L denote the test set of words. 

Let E denote the set of {TagSetc, TagSetp) (correct-predicted) tagsets for every 
word W in L. 

For any tagset TagSet, let | TagSet \ denote the total number of features in 
TagSet. 

We define precision and recall as follows: 



precision{E) 



recall{E) 



E E sim{MSD, TagSetc) 

(TagSetp, TagSetp)GE MSDeTagSetp 

I TagSetp \ 

(TagSetc, TagSetp)^E 

E E sim{MSD, TagSetp) 

(TagSetp, TagSetp)GE MSDeTagSetp 

I TagSetc \ 

(TagSetc, TagSetp)^E 



In other words, precision shows how closely the predicted tags match the gold 
standard. To compute precision, for each of the predicted MSDs the best match 
(the most similar MSD) is found in the set of correct tags, then the overall 
similarity for all predicted tags is found, and it is divided by the total number 
of features. Similarly, accuracy shows how well the correct tags are represented 
in the set of predicted tags. 



4.3 Results 

Using Segmentation Method 1 The segmentation rules learned with Clog 
were applied on the test data. Out of 877 different words in the test set, 858 
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or 97.83% were covered by the rules. For the successfully segmented words, 
predictions were made for the set of corresponding MSDs. For each word in the 
test set, the set of predicted MSDs was in average 3.89 times larger than the 
set of correct MSDs (the set of correct MSDs contains in total 21742 features, 
as opposed to 84612 in the MSDs predicted). The figures for precision (82.35%) 
and recall (91.69%), as defined in Equations can be interpreted as follows: 

1. Precision can be seen as the correctness of the predictions made, whereas 
recall quantifies the ability to produce as many of the correct tags as possible. 
The result precision < recall means that our approach performs better on 
the latter task than on the former one, i.e. it is slightly over-general, more 
careful not to reject a correct MSD than not to generate an incorrect one. 
So, the metric used has a simple and plausible interpretation. 

2. A high percentage of the MSDs predicted have a close match in the set 
of correct MSDs. Since \MSDp\ > \MSDc\, that also means that many of 
the predicted MSDs are very similar to each other, differing only in a small 
percentage of features. 



Using Segmentation Method 2 The results for this experiment are as follows: 
precision = 48.28%, recall = 99.50%. 

5 Conclusions 

This article introduces a method using a lexicon annotated with morphosyntac- 
tic features to learn rules for the prediction of those features for unseen words. 
The article also demonstrates the strength of the hybrid GA&ILP approach in 
learning segmentation rules from unnanotated words. The main advantages of 
the approach are threefold. Firstly, the lexicons of stems and suffixes produced 
by the segmentation rules learned can reliably capture the information relevant 
to the word morphosyntactic tags. This can be seen from the 99.50% recall for 
segmentation method 2, where all combinations of stems and suffixes with mat- 
ching MSDs were used to predict word tags. The second contribution of the 
hybrid approach is that it learns rules assigning a single segmentation to each 
covered word. The additional information that this segmentation brings to the 
tag prediction task is reflected in the considerable increase in precision. Finally, 
the hybrid approach only requires a relatively small list of unnanotated words 
(10^-10^ as compared to the annotated corpora of 10® words used by Brill P) 
to learn segmentation rules, which can be used either for the segmentation of 
the words used for learning, or to segment unseen words. The unsupervised fra- 
mework makes the application of the hybrid approach to word segmentation 
a possible way to apply corpus-based NLP methods requiring morphosyntactic 
tags to unannotated corpora. As the word constituents produced by the hybrid 
approach are closely related to the morphosyntactic features of the words, tag- 
ging the words in a corpus with their constituents produced by the segmentation 
rules could serve as a substitute for missing morphosyntactic tags. 
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Abstract. This paper presents an application of Inductive Logic Programming 
(ILP) and Backpropagation Neural Network (BNN) to the problem of Thai 
character recognition. In such a learning problem, there exist several different 
classes of examples; there are 77 different Thai characters. Using examples 
constructed from character images, ILP learns 77 rules each of which defines 
each character. However, some unseen character images, especially the noisy 
images, may not exactly match any learned rule, i.e., they may not be covered 
by any rule. Therefore, a method for approximating the mle that best matches 
the unseen data is needed. Here we employ BNN for finding such rules. 
Experimental results on noisy data show that the accuracy of rules learned by 
ILP without the help of BNN is comparable to other methods. Eurthermore, 
combining BNN with ILP yields the significant improvement and surpasses the 
other methods tested in our experiment. 



1 Introduction 

Inductive Logic Programming (ILP) has been successfully applied to real-world tasks, 
such as drug design[ll], traffic problem detection [4], etc. This paper presents an 
application of ILP to the task of Thai printed character recognition. Although this task 
has been widely researched for many years and there are some commercial products 
of Thai character recognition software available, the accuracies are not yet as high as 
those of English. This is due to the fact that Thai characters are comparatively more 
complex and some character is similar to others. Various approaches have been 
proposed to Thai character recognition such as the method of comparing the head of 
characters [6], backpropagation neural network [8,15], the method of combining 
fuzzy logic and syntactic method [13], the method of using cavity features [12], etc. 

The reason for choosing ILP in the task of Thai printed character recognition is that 
ILP is able to employ domain-specific background knowledge that makes a better 
generalization performance on data. However, some problems arise when ILP is 
applied to our task where there are several classes of examples, i.e., 77 different Thai 
characters. Most ILP systems work with two classes of examples (positive and 
negative), and construct a set of rules for the positive class. Any example not covered 
by the rules is classified as negative. If we want to employ these systems to learn a 
multi-class concept, we could do this by first constructing a set of rules for the first 
class with its examples as positive and the other examples as negative, then 
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constructing the sets of rules for other classes by the same process. The learned rules 
are then used to classify future data, and the rule that covers or exactly matches the 
data can be selected as the output. One major problem of this method is that some test 
data, especially noisy images in our task, may not be covered by any rule. Thus the 
method is unable to determine the correct rule. Dzeroski et al. [5] solved this problem 
by assigning the majority class recorded from training data to the test data that is not 
exactly matched against any rule. 

We approach this problem directly by proposing a method to approximate the rule 
that provides the best match with the data. Here, we employ BNN for the 
approximation of ILP rules. First we tried to approximate the rule by using the 
number of nonmatching literals (literals whose truth values are false) and the number 
of matching literals (literals whose truth values are true) as the training input vector to 
the BNN. We found that the method is able to find the approximate rule that gives a 
higher accuracy than that of ILP alone. However, our first method did not take into 
account the significance or the weight of each literal. We then redesigned the 
structure of BNN to consider the weight of each literal. In our second method, instead 
of the numbers of nonmatching and matching literals, we use the truth values of all 
literals in the rules as the input vector, and design a new structure of BNN that can 
give a weight to each literal separately. Experimental result shows that the recognition 
accuracy of the new structure is further improved, and is higher than those of the 
other methods tested in our experiment. 



2 Feature Extraction 

In the literature of Thai printed character recognition, the method of combining fuzzy 
logic and syntactic (ELS) reported very high accuracy and was shown to be one of the 
most successful methods [13]. Therefore, in the following experiment ELS will be 
used to compare with our methods, and for the comparison purpose the feature 
extraction in ELS is employed. 

After a character image is preprocessed by noise reduction and thinning 
processing, its features are extracted. Basically, the extracted feature is represented as 
a primitive vector list. A primitive vector list is composed of primitive vectors each of 
which represents the type of lines or circles in the original image. Therefore, the list 
represents the structure of lines and circles, and shows how these lines and circles are 
connected to form the character. Each primitive vector is defined to be one of 13 
types shown in Figure 1 . 

The primitive vectors of type 0 to 7 represent lines and those of type 8 to 12 
represent circles. The primitive vector of type 0 is the line whose angle is between 1 
to 45 degrees, and the angle of the type 1 is between 46 to 90 degrees, and so on. The 
primitive vector of type 8 is the circle that does not connect to any line. The primitive 
vector of type 9 is the circle that connects to a line at the quadrant 1, etc. 

In addition to the primitive vectors, other features are also extracted such as the 
starting and ending zone of a primitive vector, the level, and the ratio of the width and 
the height of characters, the list of zones containing junctions of lines, the list of zones 
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Fig. 1. Primitive Vectors 

‘a’ angles, and the list of zones containing V’ angles. The level 



indicates the character position. Thai characters are written in 3 levels, i.e., level 1, 
level 2 and level 3. For example, in Figure 2, the word “m(shrimp)” consists of four 
characters in the three levels; ‘ ^ ’, ‘n’ and are in the level 1, the level 2, the 

level 3 and the level 3, respectively. The ‘a’ and ‘v’ angles are helpful in 
distingnishing between a pair of similar characters except the angle; one of them 
contains such an angle and the other does not, such as (‘fi’,‘n’), etc. 
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Level 1 
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T Level 3 
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Fig. 2. Three levels of Thai writing system 



An instance of the extracted features of the character image ‘fi’ is shown in 
Fignre 3. This character image contains 8 primitive vectors whose numbers are 
indicated by 0, 1...7 in the fignre. For instance, the vector no. 0 is of type 1, contains 
the end point (the point not connected to another which is indicated by -1), has the 
starting zone in zone 3 and the ending zone in zone 2. The image contains ‘a’ angles 
in zone 1 . 



3 Applying Progol to Thai Character Recognition 

The ILP system used in our experiment is Progol [10]. The inputs to Progol are 
positive examples, negative examples and backgronnd knowledge. The output is a set 
of rules defining the positive examples. The following subsection explains the inputs 
and output from Progol. 
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Fig. 3. An instance of features extracted from a character image. 



3.1 Example Representation & Background Knowledge 

An example to Progol is a set of extracted features explained in the previous section. 
It is the same as one used by FLS, except that it is converted to the logical 
representation in the following form: 

char(A,B,C,D,E,F) 

where char is a character, A is the level of the character, B is the ratio of the width and 
the height, C is the list of numbers each of which represents a primitive vector, 
whether or not the primitive vector contains the end point, the starting zone and the 
ending zone of the vector, D is the list of zones that contain junctions of lines, E is the 
list of the zones that contain the V’ angles, and E is the list of the zones that contain 
the ‘a’ angles. For ease of writing definition of background knowledge, we map a 
tuple “(vectorType, endPoint, startingZone, cnJingZoncj” into a number “VectorlD”, 
and used it in C. Figure 4 shows some examples of the character ‘n’. 

n(3,0.78,[155,83,467,211,82,970,970,781],[],[],[zl]). 
n(3,0.76,[155,339,339,21 1,83,83,978,970,781], [],[], [z2]). 
fi(3,0.72,[155,467,467,211,211,83,82,970,842,845,805,19],[z2],[z2],[zl]). 
fi(3,0.75,[155,467,21 1,83,83,978,842,842,714,653,19], [z2],[z2],[z2]). 
fi(3,0.75,[155,467,211,83,82,970,845,677,19],[z2],[z2],[zl]). 
fi(3,0.78,[155,467,211,83,82,970,970,781,83,979,19],[z2],[z2,z2],[z2,zl]). 

Fig. 4. Examples of character ‘fl’ 

The Thai character set consists of 77 different characters (‘fi’, ‘ni’, ‘”u’,..., ‘g’). We 
choose two types of fonts, i.e., Cordia and Eucrosia, each of which contains 7 






166 B. Kijsirikul and S. Sinthupinyo 



different sizes (20,22,24,28,32,36 and 48 points). Therefore, the number of examples 
is 77 X 2 X 7 = 1078. 

We have constructed background knowledge that describes our knowledge about 
the domain of Thai character recognition. The appropriate background knowledge 
will help Progol produce more accurate rules. For example, if we believe that the zone 
of the head of the character can be used to discriminate between the characters, we 
will add a predicate, such as head_zone (Listof PrimitiveVector , InZone) 
that examines this characteristic. The background knowledge used in our experiment 
contains 55 predicates. Some of them are shown in Figure 5. Note that some 
background predicates are complicated and defined by a recursive program. This is 
one reason of using Progol in our task. 



head_zone (A, B) head (A, C), startZone (C, B) . 
head_primitive (A, B) head(A,C), primitive (C, B) . 

count_primitive_type4 ( [] , 0) . 
count_primitive_type4 ( [A|B] ,C) 

primitive (A, 4 ) , count_primitive_type4 (B, D) , C is D+1. 
count_primitive_type4 ( [A|B] ,C) : - 

not primitive (A, 4 ) , count_primitive_type4 (B, C) . 

v_angle_at_head (A) member ( z2 , A) . 

endpoint_primitive (A, B) 

member (C, A) , endpoint (C, - 1) , primitive (C, B) . 
circle_at_endpoint_in_zone (A, B) ; - 

member (C, A), isCircleEndpoint (C) , startZone (C, B) . 
count_circle_at_endpoint ( [] , 0) . 
count_circle_at_endpoint ( [A | B] , C) 

isCircleEndpoint (A) , count_circle_at_endpoint (B, D) , C is D+1. 
count_circle_at_endpoint ( [A | B] , C) : - 

not isCircleEndpoint (A) , count_circle_at_endpoint (B, D) . 
count_startZone5 ( [] , 0) . 
count_startZone5 ( [A I B] , C) 

StartZone (A, 5 ) , count_startZone5 (B, D) , C is D+1. 
count_startZone5 ( [A I B] , C) 

not StartZone (A, 5) , count_startZone5 (B, C) . 
right_line (A) 

member (B, A), endPoint (B , - 1 ) , endZone (B , 1 ) , primitive (B, 1) . 
right_line (A) 

member (B, A), endPoint (B, -1) , endZone (B, 1) , primitive (B , 2 ) . 
member_zone (A, B, C) member(E,A), startZone (E , B) , endZone (E , C) . 
head_primitive_type9orl0 (A) head_primitive (A, 9 ) . 

head_primitive_type9orl0 (A) head_primitive (A, 10 ) . 

head_primitive_typelOorll (A) head_primitive (A, 10) . 

head_primitive_typelOorll (A) head_primitive (A, 11 ) . 



Fig. 5. Some of background knowledge used in the experiment. 
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3.2 The Output 

We first train Progol to induce the rule for character ‘n’ by 14 examples of ‘n’ as 
positive examples and the rest (1,064 examples) as negative examples. The training 
process is then repeated for constructing the rules of all characters. The output of 
Progol is a set of 77 rules that are used to classify the future character images. Each 
rule defines the characteristics of each character. Note that we did not force Progol to 
learn one rule per class, but the output having one rule per class is just that obtained 
by Progol. 

Figure 6 shows some of the learned rules. For instance, the first rule in Figure 6 
defines the character ‘n’; an input character image will be recognized as the character 
‘fi’ if that character is in level 3, has the head in zone 3, the head of character is the 
primitive vector of type 1 , and the number of primitive vector of type 4 is equal to 0. 
These rules will be used to compare with the features of a future character image. The 
rule that exactly matches the image is selected as the output. However, in the case of 
noisy image or unseen data, a character image may not exactly match any rule in the 
rule set. In the next section, we will describe the methods of using BNN to 
approximately match the rule with the character image. 



n (3,A,B,C,D,E) 
ni (3,A,B,C,D,E) 



»U (3,A,B,C,D,E) 



fl (3,A,B, [z3] ,C,D) 



head_zone (B, 3 ) , head_primitive (B, 1) , 
count_primitive_type4 (B, 0) . 
not v_angle_at_head (D) , head_zone (B , 2 ) , 
endpoint_primitive (B, 1) , 
circle_at_endpoint_in_zone (B, 2) , 
count_circle_at_endpoint (B, 1) , 
right_line (B) , member_zone (B, 4 , 1) , 
head_primitive_type9orl0 (B) , 
count_startZone5 (B, 0) . 

head_zone (B , 2 ) , head_primitive (B, 10 ) , 
endpoint_zone (B , 4 ) , 
begin_and_endzone (B, 2 , 1) , 
member_zone (B, 2 , 1) , member_zone (B, 4 , 1) , 
have_member (B , 7 , 2 , 2 ) , 
head_primitive_type9orl0 (B) , 
head_primitive_typelOorll (B) . 

not A_angle ( [z3] ) , not v_angle (C) , 
head_zone (B , 0 ) , endpoint_primitive (B, 6) . 



Fig. 6. Some Rules learned by Progol. 
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4 Approximate ILP rules by BNN 

Backpropagation neural network (BNN) [14] is widely applied to various recognition 
tasks. In this paper, BNNs are employed to select the rule that closely matches with 
the input image. The following subsections explain two types of BNNs, i.e., (1) BNN 
that uses the number of nonmatching literals and the number of matching literals of 
the rules as the input vector, and (2) BNN that uses the truth values of literals of the 
rules as the input vector. 



4.1 Using Numbers of Nonmatching and Matching Literals (1st BNN) 

There can be many ways to approximate the rule that best matches the data. A simple 
choice is based on the assumption that the best rule should be the rule that contains a 
small number of nonmatching literals but a large number of matching literals. We 
then first build a simple BNN to test this idea. 

The structure of the first BNN is composed of three layers: (1) the input layer 
consisting of 154 neurons, (2) the hidden layer consisting of 154 neurons, and (3) the 
output neurons having 77 neurons. All neurons receive real values. The hidden and 
output neurons use the sigmoid function. The links from the input neurons to the 
hidden neurons, and from the hidden neurons to the output neurons are fully 
connected, as shown in Figure 7. 

As described above, the training set in our experiment consists of 77 different 
characters (‘fi’, ‘ni’, ‘'ll’, ‘fi’,...,‘g’). Each character has the corresponding rule 
produced by Progol. In the training process of BNN, for each training example, we 
evaluate the number of nonmatching and number of matching literals by comparing 
the example with all rules in the rule set. Therefore, 154 numbers from all 77 rules in 
the rule set are used as the training input vector for BNN. The hidden layer consists of 
154 neurons, simply the same as the number of the input neurons. The output layer is 
composed of 77 neurons, each of which represents a character. When the BNN is 
trained by an example of the character ‘n’, the first output neuron that corresponds to 
the character is activated to 1 and the other neurons are set to 0. In the recognition 
process, the neuron with the maximum activation will be turned on. 

Figure 7 illustrates the input and output vectors of an example of character ‘fi’. 
First, the numbers of nonmatching and matching literals of the rule of character ‘fi’ 
are counted, that in our experiment are 0 and 4, respectively. Then the example is 
examined with the rule of the character ‘ni’, and numbers of nonmatching and 
matching literals are 5 and 5. After the example is checked with all 77 rules, the 
numbers of nonmatching and matching literals are then fed into the neuron network as 
an input vector of the character ‘n’. In the training mode, the first neuron in the output 
layer that corresponds to character ‘n’ is set to 1. The training process is then repeated 
for the other examples of character ‘n’ as well as the examples of the other characters. 
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Input Layer Hidden Layer Output Layer 

154 Neurons Neurons 77 Neurons 

Fig. 7. The structure of the first BNN and the input & output vectors of character ‘fl’ 



4.2 Using Truth Values of Literals (2nd BNN) 

The disadvantage of the first BNN is that it takes each literal in a rule with equal 
significance. Thus it is unable to give important literals higher weights than others. 
We argue that literals in a rule have different levels of significance in classifying the 
characters. To capture this idea, instead of the numbers of nonmatching and matching 
literals, we design the structure of our second BNN by using the truth values of all 
literals for constructing the input vector. In the rules obtained in our experiment, no 
literal introduces a new variable which makes it easy to determine its truth value (true 
or false). 

In this type of BNN, the value of matching literal is set to 1 (true) whereas the 
value of nonmatching literal is set to 0 (false) for an input neuron. Each hidden 
neuron represents each rule in the rule set. All input neurons corresponding to literals 
in the same rule are linked to one hidden neuron that represents the rule. Therefore the 
number of the hidden neurons is equal to the number of all rules; in our experiment, 
the number is 77, since only one rule per class is obtained by Progol. Nevertheless, if 
some class is defined by more than one rule, the additional hidden neurons together 
with their corresponding input neurons must be included, and this can be directly 
handled by the proposed BNN without any change. 

Figure 8 shows the structure of our second BNN, and the input and output vectors 
for training the character ‘n’. When training the network with an example of character 
‘n’, the example is examined with all literals in all 77 rules. First, the truth values of 
four literals in the rule of the character ‘n’ in Figure 6 are evaluated to 1, 1, 1 and 1. 
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Then the example is examined with the rule of the character ‘ni’, and the values of the 
literals are 1, 1,0, 1, 0, 0, 1, 0, 0 and 1. After the image is checked with all 77 rules, 
these numbers are then fed into the neural network as the input vector for the 
character ‘fi’. In the training mode, the first neuron in the output layer that 
corresponds to the character ‘fi’ is set to 1. This training process is then repeated for 
the other examples of the character ‘n’ as well as those of the other characters. In the 
recognition mode, the neuron with the highest-valued output will be taken as the 
prediction. 




77 Neurons 77 Neurons 



Fig. 8. The structure of the second BNN and the input & output vectors of character ‘fi’. 



5 Experimental Results 

The experiment was run to test our methods. The training character images in our 
experiment are printed by the laser printer with the resolution 300 dpi and scanned 
into the computer by the scanner with the same resolution. These images consists of 
77 different characters (‘n’, ‘ni’, ‘'ll’, ‘fi’,...,‘g’), 2 fonts (Cordia and Eucrosia), and 7 
sizes (20, 22, 24, 28, 32, 36 and 48 points). These images are then fed into noise 
reduction and thinning processing for finding the structure of the images. Next, the 
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important features are extracted for constructing the training examples. These 
examples are used to train FLS and our methods. For testing, the noise was added into 
an original image by copying the image by a photocopy machine twice; darker and 
lighter ones. The total number of test data was 2x2x7x7=2156. The experimental 
results on the test data are shown in Table 1. We did not report the accuracies of the 
methods on training set since they are almost the same and nearly 100%. We also 
include in Table 1 the result of using BNN alone with the same training examples. 



Table 1. Accuracies of FLS, Progol, BNN and Progol & BNN on noisy images. 



Font & 
Size 


No. 

character 


FLS 


BNN 


Progol 


Progol& 
1st BNN 


Progol & 
2nd BNN 


C20* 


154 


86.54 


80.52 


80.52 


93.47 


94.77 


C22 


154 


88.90 


84.21 


89.61 


95.42 


98.04 


C24 


154 


89.20 


90.91 


87.01 


93.54 


94.77 


C28 


154 


92.89 


79.22 


88.31 


94.81 


96.10 


C32 


154 


88.23 


85.71 


88.31 


95.42 


98.04 


C36 


154 


91.58 


86.84 


89.61 


97.37 


98.68 


C48 


154 


87.95 


72.73 


85.06 


92.84 


93.46 


E20* 


154 


68.19 


81.82 


68.18 


75.99 


78.57 


E22 


154 


86.89 


82.89 


88.31 


92.86 


94.77 


E24 


154 


87.86 


93.51 


83.77 


90.92 


92.86 


E28 


154 


86.89 


88.31 


83.77 


91.56 


95.45 


E32 


154 


93.51 


92.21 


89.61 


95.46 


95.45 


E36 


154 


92.20 


92.11 


88.31 


95.45 


96.05 


E48 


154 


85.90 


68.42 


79.22 


90.76 


92.72 


Average 


2156 


87.62 


84.25 


84.97 


92.55 


94.26 



* C and E are fonts Cordia and Eucrosia, respectively. 



The results reveal that the accuracy of rules produced by Progol is comparable to 
BNN, and is lower than the accuracy of ELS. Since Progol can correctly recognize the 
input data only when the data is perfectly matched with the correct rule, it is not able 
to recognize some input data that is matched partly with the correct rule and partly 
with others. This is the reason for lower accuracy of Progol. On the other hand, ELS 
always chooses the most similar stored pattern as the matching character, even if there 
is no stored pattern that exactly matches the input data. After approximate match was 
performed by using the numbers of nonmatching and matching literals, the accuracy 
is improved to 92.55%. However, in the case of character that is similar to others, the 
first BNN may misclassify the character. Eor instance, in our experiment we found the 
case that the first BNN misclassified a character image ‘n’ as the character image ‘n’. 
In this case, the second BNN that uses the truth values of literals correctly classified 
this character image. As shown in Table 1, the second BNN yields significant 
improvement by achieving 94.26% accuracy. 
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6 Related Works and Limitations 

There are some ILP systems that are able to learn multi-class concepts represented in 
first-order rules, such as RTL [1], ICL[7], MULT_ICN[9]. Nevertheless, it is still 
possible that some unseen data may not exactly match the rules produced by these 
systems. We believe that our method for approximating rules can also be applied to 
these systems. TILDE [2] is a first-order extension of decision tree learning algorithm 
and it will assigns some class to a given unseen data. Therefore, TILDE can be 
applied to onr task as well. Eurther stndy is reqnired to compare our method and 
TILDE that is one of our ongoing work. KB ANN [16] and EONN [3] are the systems 
that employ initial rules and training examples for learning neural networks. These 
systems showed that the performance is better than methods which learn purely from 
examples. EONN translates first-order rules possibly including literals with new 
variables into neural network. Its primary goal is to refine numerical literals of the 
rules. It first finds only substitutions for new variables that satisfy non-numerical 
literals and uses these substitutions for refining the numerical literals [3]. One 
limitation of onr method is that the network deals only with rules which introduce no 
new variable. Though the rules with no new variable are sufficient for onr current 
task, to be applied to other tasks where rules introduce new variables, the method 
shonld be further studied. The method like that of EONN may be helpful. In our task, 
we consider the method of approximation of ILP rules when there is no rule that 
exactly matches unseen data. Though it did not occnr in our current task, a related 
problem when multiple rules fire will be studied in the near future. 



7 Conclusions 

We have proposed an improved method that combines ILP and BNN for the task of 
Thai printed character recognition. The experimental results show that our method 
gives a significant improvement on previous methods. The improved results come 
from the combination of ILP and BNN. ILP produces rules that accurately classify the 
training data, and BNN makes the mle more flexible for approximately matching with 
unseen or noisy data. Moreover, the results also demonstrate that to approximate the 
rule, each predicate in the rule should be weighted unequally; this is accomplished by 
using the truth values of literals as the input vector and separately assigning a weight 
to each literal. 

Acknowledgements. We would like to thank Decha Rattanatarn for providing ns 
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Abstract. Numerous measures are used for performance evaluation in 
machine learning. In predictive knowledge discovery, the most frequently 
used measure is classification accuracy. With new tasks being addressed 
in knowledge discovery, new measures appear. In descriptive knowledge 
discovery, where induced rules are not primarily intended for classifica- 
tion, new measures used are novelty in clausal and subgroup discovery, 
and support and confidence in association rule learning. Additional mea- 
sures are needed as many descriptive knowledge discovery tasks involve 
the induction of a large set of redundant rules and the problem is the 
ranking and filtering of the induced rule set. In this paper we develop 
a unifying view on some of the existing measures for predictive and de- 
scriptive induction. We provide a common terminology and notation by 
means of contingency tables. We demonstrate how to trade off these 
measures, by using what we call weighted relative accuracy. The paper 
furthermore demonstrates that many rule evaluation measures develo- 
ped for predictive knowledge discovery can be adapted to descriptive 
knowledge discovery tasks. 



1 Introduction 

Numerous measures are used for performance evaluation in machine learning and 
knowledge discovery. In classification-oriented predictive induction, the most fre- 
quently used measure is classification accuracy. Other standard measures include 
precision and recall in information retrieval, and sensitivity and specificity in me- 
dical data analysis. With new tasks being addressed in knowledge discovery, new 
measures need to be defined, such as novelty in clausal and subgroup discovery, 
and support and confidence in association rule learning. These new knowledge 
discovery tasks belong to what is called descriptive induction. Descriptive induc- 
tion also includes other knowledge discovery tasks, such as learning of properties, 
integrity constraints, and attribute dependencies. 

This paper provides an analysis of selected rule evaluation measures. The 
analysis applies to cases where single rules have to be ranked according to how 
well they are supported by the data. It also applies to both predictive and de- 
scriptive induction. As we argue in this paper, the right way to use standard rule 
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evaluation measures is relative to some threshold, e.g., relative to the trivial rule 
‘all instances belong to this class’. We thus introduce relative versions of these 
standard measures, e.g., relative accuracy. We then show that relative measures 
provide a link with descriptive measures estimating novelty. Furthermore, by 
taking a weighted variant of such relative measures we show that we in fact ob- 
tain a trade-off between several of them by maximizing a single measure called 
weighted relative accuracy. 

The outline of the paper is as follows. In Section 0we introduce the termino- 
logy and notation used in this paper. In particular, we introduce the contingency 
table notation that will be put to use in Section|3 where we formulate predictive 
and descriptive measures found in the literature in this framework. Our main 
results concerning unifications between different predictive measures, and bet- 
ween predictive and descriptive measures, are presented in Sectional In Section 
iniwe support our theoretical analysis with some preliminary empirical evidence. 
Finally, in Section El we discuss the main contributions of this work. 

2 Terminology and Notation 

In this section we introduce a terminology and notation used throughout the 
paper. Since we are not restricted to predictive induction, the rules we consider 
have a more general format than the format of prediction rules that have a 
single classification literal in the conclusion of a rule. Below we only assume that 
induced rules are implications with a head and a body f Section 12 . 11 ) . Due to 
this general rule form, the notions of positive and negative example have to be 
generalized: predicted positives/negatives are those instances for which the body 
is true/false, and actual positives/negatives are instances for which the head is 
true/false. In this framework, a contingency table, as explained in Section 
is used as the basis for computing rule evaluation measures. 



2.1 Rules 

We restrict attention to learning systems that induce rules of the form 

Head -f— Body 

Predictive induction deals with learning of rules aimed at prediction and/or 
classification tasks. The inputs to predictive learners are classified examples, and 
the outputs are prediction or classification rules. These rules can be induced by 
propositional or by first-order learners. In propositional predictive rules. Body is 
(typically) a conjunction of attribute- value pairs, and Head is a class assignment. 
In first-order learning, frequently referred to as inductive logic programming, pre- 
dictive rules are Prolog clauses, where Head is a single positive literal and Body 
is a conjunction of positive and/or negative literals. The important difference 
with propositional predictive rules is that first-order rules contain variables that 
are shared between literals and between Head and Body. 
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Descriptive induction deals with learning of rules aimed at knowledge disco- 
very tasks other than classification tasks. Those include learning of properties, 
integrity constraints, functional dependencies, as well as the discovery of inte- 
resting subgroups, association rule learning, etc. The input to descriptive learners 
are unclassified instances, i.e., descriptive induction is unsupervised. In compa- 
rison with propositional prediction rules, in which Head is a class assignment, 
association rules allow the Head to be a conjunction of attribute tests. Pro- 
positional association rules have recently been upgraded to the first-order case 
P). Descriptive first-order rules also include general clauses, which allow for a 
disjunction of literals to be used in the Head. 

In the abstract framework of this paper, rules are binary objects consisting of 
Head and Body. Rule evaluation measures are intended to give an indication of 
the strength of the (hypothetical) association between Body and Head expressed 
by such a rule. We assume a certain unspecified language bias that determines 
all possible heads and bodies of rules. We also assume a given set of instances, 
i.e., classified or unclassified examples, and we assume a given procedure by 
which we can determine, for every possible Head and Body, whether or not it 
is true for that instance. We say that an instance is covered by a rule Head <— 
Body if Body is true for the instance. In the propositional case, an instance is 
covered when it satisfies the conditions of a rule (all the conditions of a rule 
are evaluated true given the instance description). In the first-order case, the 
atom(s) describing the instance are matched with the rule head, thus determining 
a substitution 9 by which the variables in the rule head are replaced by the terms 
(constants) in the instance description. The rule covers the instance iff BodyO is 
evaluated as true. 

2.2 Contingency Table 

Given the above concepts, we can construct a contingency table for an arbitrary 
rule <— R. In Tabled B denotes the set of instances for which the body of 
the rule is true, and B denotes its complement (the set of instances for which 
the body is false); similarly for H and H. HB then denotes HOB, HB denotes 
n R, and so on. 



Table 1. A contingency table. 





B 


B 




H 


n{HB) n{HB) 


n{H) 


H 


n{HB) n{HB) 


n{H) 




n{B) 


n{B) 


N 



We use n{X) to denote the cardinality of set X, e.g., n{HB) is the number 
of instances for which H is false and B is true (i.e., the number of instances 
erroneously covered by the rule). N denotes the total number of instances in 
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the sample. The relative frequency associated with X is denoted by p(Al)0 
All rule evaluation measures considered in this paper are defined in terms of 
frequencies from the contingency table only. 

Notice that a contingency table is a generalisation of a confusion matrix, 
which is the standard basis for computing rule evaluation measures in binary 
classification problems. In the confusion matrix notation, n{H) = Pa - the 
number of positive examples, n{H) = Na - the number of negative examples, 
n{B) = Pp - the number of examples covered by the rule therefore predicted as 
positive, n(B) = Np - the number of the examples not covered by the rule and 
therefore predicted as negative, n{HB) = TP - the number of true positives, 
n{HB) = TN - the number of true negatives, n{HB) = FP - the number of 
false positives, and n{HB) = FN - the number of false negatives. 



3 Selected Rule Evaluation Measures 



In this section, selected rule evaluation measures are formulated in the con- 
tingency table terminology, which is the first step towards the unifying view 
developed in Section 0 The definitions are given in terms of relative frequencies 
derived from the contingency table. Since our framework is not restricted to pre- 
dictive induction, we also elaborate some novelty-based measures found in the 
knowledge discovery literature; see m which discuss also other measures and 
the axioms that rule evaluation measures should satisfy. The usefulness of our 
unifying framework is demonstrated in Section 0 where we point out the many 
relations that exist between weighted and relative variants of these measures. 

Definition 1 (Rule accuracy). Acc{Fl <— B) =p{H\B). 

Definition 2 (Negative reliability). NegRel{F[ ^ B) =p(F[\B). 



Definition 3 (Sensitivity). Sens{H <— B) = p{B\H) . 

Definition 4 (Specificity). Spec{P[ ^ B) = p{B\H) . 

Accuracy of rule R = H ^ B, here defined as the conditional probability 
that H is true given that B is true, measures the fraction of predicted positives 
that are true positives in the case of binary classification problems: 



Acc{R) 



TP 

TP + FP 



n{HB) 

n{HB) + n(HB) 



n{HB) 

n{B) 



n{HB) 

N 

N 



p{HB) 

p{B) 



= p{H\B). 



Rule accuracy is also called precision in information retrieval. Furthermore, ac- 
curacy error Err{H ^ B) = 1 — Acc{F[ ^ B) = p{H\B). 

Our definition of rule accuracy is intended for evaluating single rules, and 
therefore biased towards the accuracy of positive examples. As such, it is dif- 
ferent from what we call rule set accuracy 0, defined as Acc = = 

^ In this paper we are not really concerned with probability estimation, and we inter- 
pret the sample relative frequency as a probability. 
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p{HB) +p{HB), which is standardly used for evaluation of hypotheses compri- 
sed of several rules. 

Given our general knowledge discovery framework, it can now also be seen 
that rule accuracy is in fact the same as confidence in association rule learning. 
Rule accuracy can also be used to measure the reliability of the rule in the 
prediction of positive cases, since it measures the correctness of returned results. 
The reliability of negative predictions is in binary classification problems 

computed as follows: NegRel(R) = ywTfn = ¥; = = P(H\B). 

Sensitivity is identical to recall (of positive cases) used in information retrie- 
val. Sensitivity, here defined as the conditional probability that B is true given 
that H is true, measures the fraction of true positives that are correctly classi- 
tied in the case of binary classification problems: Sens(R) = jp+fat “ “p“ “ 

n{HB)+?{HB) = Sensitivity can also be interpreted 

as the accuracy of the rule B <— H, which in logic programming terms is the 
completion of the rule H ^ B. 

Specificity is the conditional probability that B is false given that H is false. 
In binary classification problems, it is equal to the recall of negative cases in 
information retrieval: Spec{R) = ™ =p{B\H). 

We now introduce other measures that are used to develop our unifying view 
in the next section. 

Definition 5 (Coverage). Cov{H ^ B) =p{B). 



Definition 6 (Support). Sup{H <— B) =p{HB). 

Coverage measures the fraction of instances covered by the body of a rule. As 
such it is a measure of generality of a rule. Support of a rule is a related measure 
known from association rule learning, also called frequency. Notice that, unlike 
the previous measures, support is symmetric in H and B. 

The next measure aims at assessing the novelty, interestingness or unusualn- 
ess of a rule. Novelty measures are used, e.g., in the MIDOS system for subgroup 
discovery 0, and in the PRIMUS family of systems for clausal discovery Here 
we follow the elaboration of the PRIMUS novelty measure, because it is formu- 
lated in the more general setting of clausal discovery, and because it is clearly 
linked with the contingency table framework. 

Consider again the contingency table in Table ^ We define a rule H ^ B 
to be novel if n{HB) cannot be inferred from the marginal frequencies n{H) 
and n{B)] in other words, if H and B are not statistically independent. We 
thus compare the observed n{HB) with the expected value under independence 
fj,{HB) = xhe more the observed value n{HB) differs from the expec- 

ted value p{HB), the more likely it is that there exists a real and unexpected 
association between H and B, expressed by the rule H ^ B. Novelty is thus 
defined as the relative difference between n{HB) and p{HB). 

Definition 7 (Novelty). Nov{H <— B) =p{HB) —p{H)p{B). 
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Notice that p{HB) is what is called support in association rule learning. The 
definition of novelty states that we are only interested in high support if that 
couldn’t be expected from the marginal probabilities, i.e., when p{H) and/or 
p{B) are relatively low. It can be demonstrated that —0.25 < Nov{R) < 0.25: a 
strongly positive value indicates a strong association between H and B, while a 
strongly negative value indicates a strong association between H and bQ 

In the MIDOS subgroup discovery system this measure is used to detect 
unusual subgroups. For selected head H, indicating a property we are interested 
in, body B defines an unusual subgroup of the instances satisfying H if the 
distribution of i?-instances among B-instances is sufficiently different from the 
distribution of iJ-instances in the sample. In situations like this, where H is 
selected, this definition of novelty is sufficient. However, notice that Nov{H <— 
B) is symmetric in H and B, which means that H ^ B and B ^ H will 
always carry the same novelty, even though one of them may have many more 
counter-instances (satisfying the body but falsifying the head) than the other. 

To distinguish between such cases, PRIMUS additionally employs the mea- 
sure of satisfaction, which is the relative decrease in accuracy error between the 
rule H <— true and the rule B <— B. It is a variant of rule accuracy which takes 
the whole of the contingency table into account — it is thus more suited towards 
knowledge discovery, being able trading off rules with different heads as well as 
bodies. 

Definition 8 (Satisfaction). Sat{H -f— B) = 

It can be shown that Sat{H ^ B) = since p{H) — p{H\B) = 

(I — p{H)) — (1 — p{H\B)) = p{H\B) — p{H). We thus see that Sat{H ^ B) 
is similar to rule accuracy p{H\B), e.g., Sat{R) = 1 iff Acc{R) = 1. However, 
unlike rule accuracy, satisfaction takes the whole of the contingency table into 
account and is thus more suited towards knowledge discovery, trading off rules 
with different heads as well as bodies. 

Finally, we mention that PRIMUS trades off novelty and satisfaction by 
multiplying them, resulting in a x^-like statistic: 

p[HB) 

This is one term in the sum for the contingency table, corresponding to the 
lower left-hand cell (the counter-instances). We omit the details of the norma- 
lization. 



^ Since negative novelty can be transformed into positive novelty associated with the 
rule H ^ B, systems like MIDOS and PRIMUS set Nov{H ^ B) = 0 if p{HB) < 
p{H)p{B). The more general expression of Definition Q is kept because it allows a 
more straightforward statement of our main results. 

^ Again, in practice we put Sat{R) = 0 if p{H) > p{H\B). 
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4 A Unifying View 

In the previous section we formulated selected rule evaluation measures in our 
more general knowledge discovery framework. In this section we show the use- 
fulness of this framework by establishing a synthesis between these measures. 
The main inspiration for this synthesis comes from the novelty measure, which 
is relative in the sense that it compares the support of the rule with the expected 
support under the assumption of statistical independence (Definition 0 . 

Definition 9 (Relative accuracy). RAcc{H ^ B) = p{H\B) —p{H). 

Relative accuracy of a rule R — H^Bis the accuracy gain relative to the fixed 
rule H ^ true. The latter rule predicts all instances to satisfy H] a rule is only 
interesting if it improves upon this ‘default’ accuracy. Another way of viewing 
relative accuracy is that it measures the utility of connecting body B with a 
given head H . 

Similarly, we define relative versions of other rule evaluation measures. 

Definition 10 (Relative negative reliability). 

RNegRel{H ^ B) = p(H\B)-p{H). 

Definition 11 (Relative sensitivity). RSens{H ^ B) = p{B\H) —p{B). 

Definition 12 (Relative specificity). RSpec{H ^ B) = p{B\H) —p{B). 

Like relative accuracy, relative negative reliability measures the utility of connec- 
ting body B with a given head H . The latter two measures can be interpreted 
as sensitivity/specificity gain relative to the rule true <— B, i.e., the utility of 
connecting a given body B with head H . Notice that this view is taken in rule 
construction by the CN2 algorithm 0, which first builds a rule body and sub- 
sequently assigns an appropriate rule head. 

To repeat, the point about relative measures is that they give more infor- 
mation about the utility of a rule than absolute measures. For instance, if in a 
prediction task the accuracy of a rule is lower than the relative frequency of the 
class it predicts, then the rule actually performs badly, regardless of its absolute 
accuracy. 

There is however a problem with relative accuracy as such: it is easy to obtain 
high relative accuracy with highly specific rules, i.e., rules with low generality 
p{B). To this end, a weighted variant is introduced, which is the key notion in 
this paper. 

Definition 13 (Weighted relative accuracy). 

WRAcc{H ^ B)= p{B){p{H\B) -p{H)). 

Weighted relative accuracy trades off generality and relative accuracy. It is known 
in the literature as a gain measure, used to evaluate the utility of a literal L 
considered for extending the body R of a rule: {p{H\BL) — p{H\B)). 

We now come to a result, which — although technically trivial — provides 
a significant contribution to our understanding of rule evaluation measures. 
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Theorem 1. WRAcc{R) = Nov{R). 

Proof. WRAcc{H ^ B) = p{B){p{H\B) - p{H)) = p{B)p{H\B) - p{H)p{B) = 
p{HB) - p{H)p{B) = Nov{H ^ B). □ 

Theorem [0 has the following implications. 

1. Rules with high weighted relative accuracy also have high novelty, and vice 
versa. 

2. High novelty is achieved by trading off generality and rule accuracy gained 
in comparison with a trivial rule H <— true. This also means that having 
high relative accuracy is not enough for considering a rule to be interesting, 
since the rule needs to be general enough as well. 

This link between predictive and descriptive rule evaluation measures has — to 
the best of our knowledge — not been published before. 

We proceed to show that weighted relative accuracy is one of the most fun- 
damental rule evaluation measures, by showing that it also provides a trade-off 
between accuracy and other predictive measures such as sensitivity. To do so, 
we first define weighted versions of the other relative measures defined above. 

Definition 14 (Weighted relative negative reliability). 

WRNegRel{H ^ B) = p{B){p{H\B) -p{H)). 

The weight p{B) is motivated by the fact that overly general rules trivially have 
a high negative reliability. 

Definition 15 (Weighted relative sensitivity). 

WRSens{H ^ B) = p{H){p{B\H) - p{B)). 

Definition 16 (Weighted relative specificity). 

WRSpec{H ^ B)= p(H){p(B\H) -p(B)). 

Again, the weights guard against trivial solutions. 

This leads us to establishing a trade-off between the four standard predic- 
tive rule evaluation measures, by relating them through their weighted relative 
variants. 

Theorem 2. WRAcc{R) = WRSens{R) = WRSpec{R) = WRNegRel{R). 

Proof WRAcc{H ^ B) = p{B){p{H\B) - p{H)) = p{HB) - p{H)p{B) = 
p{H){p{B\H) - p{B)) = WRSens{H ^ B). _ 

WRAcc{H ^H) = p{B){p(^\B)-p{I^) = p{HB)^p{H)pj^B) = {l-p{HB)~ 
p{HB)-pJJlB)) - (1 -p(H))(l^(i7)) ^(1 -_p{H) -£B) +p±HB)) - (1 - 
p{H) - p{B) + p{H)p{B)) = p{HB) - p{H)p{B) = p{H){p{B\H) - p{B)) = 

WRSpec(H^B). __ 

WR^ec{H ^ B)= p{H){p{B\H)-p{B)) = p{HB)-p{H)p{B) = p{B){p{H\B) 
- p{H)) = WRNegRel{H ^ B). □ 
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We have thus established a complete synthesis between different predictive 
rule evaluation measures, and between these measures and the descriptive notion 
of novelty, by demonstrating that there is a single way in which all these measures 
can be combined and thus traded off in a principled way. 



5 Rule Evaluation Measures in Practice 

In the previous section we have shown that a single measure, weighted relative 
accuracy, can be used to trade off different evaluation measures such as accuracy, 
sensitivity, and novelty. In this section we further support this claim with some 
preliminary empirical evidence. First, we describe an experiment in which weigh- 
ted relative accuracy correlates better with an expert’s intuitive understanding 
of “reliability” and “interestingness” than standard rule evaluation measures. Se- 
condly, we show the utility of weighted relative accuracy as a filtering measure 
in database dependency discovery. 

5.1 An Experiment 

The purpose of this experiment was to find out whether rule evaluation measures 
as discussed in this paper really measure what they are supposed to measure. To 
this end we compared an expert’s ranking of a number of rules on two dimensions 
with the rankings given by four selected measures. We have used a CAR data 
set (see UCI Machine Learning Repository Q), which includes 1728 instances 
that are described with six attributes and a corresponding four-valued class. 
The attributes are multi-valued and include buying price, price of maintenance, 
number of doors, capacity in terms of persons to carry, and estimated safety of 
the car. 

An ML* Machine Learning environment was used to generate association 
rules from the CAR dataset. The designer of the experiment has semi-randomly 
chosen ten rules that he though may be of different quality in respect to the 
measures introduced in this text. Note that none of the rules, however, was 
explicitly measured at this stage. 

The rules were then shown to the domain expert, who was asked to rank 
them according to their “reliability” and “interestingness” . We chose these non- 
technical terms to avoid possible interference with any technical interpretation; 
neither term was in any way explained to the expert 0 The domain expert first 

During the experiment, the expert expressed some of his intuitions regarding these 
terms: “reliability measures how reliable the rule is when applied for a classification” ; 
“an interesting rule is the one that I never thought of when building a classihcation 
model, e.g., those without the class (car) in the head”; 

“an interesting rule has to tell me something new, but needs to be reliable as well 
(it would help me if I would somehow know the reliability first before ranking on 
interestingness)” ; 

“a highly reliable rule which is at the same time unusual is interesting” ; 

“a rule is interesting if it tells me something new, but it’s not an outlier”. 
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assigned qualitative grades to each rule (-,o,0,+), and then chose a final rank 
from these grades. The results of the ranking are shown in Table |3 Note that 
some rules are ranked equally (e.g., the first two rules for reliability), and in such 
cases a rank is represented as an interval. The correlation between the expert’s 
rankings and ranks obtained from the rule evaluation measures are given in 
Table |3 



Table 2. Ten rules ranked by a domain expert on reliability (Rel) and interestingness 
(Int), and corresponding rule evaluation measures. 



Rule 




Expert 




Rule evaluation measures 




Rel 


# 


Int 


# 


Acc Sens Spec WRAcc 


buying=med car=good — > maint=low 


- 


7-10 


o 


6 


1.000 0.053 1.000 


0.010 


buying=low car=v-good — > lugboot=big 


- 


7-10 


- 


7-10 


0.615 0.042 0.987 


0.006 


safety=low ^ car=unacc 


-f 


1 


- 


7-10 


1.000 0.476 1.000 


0.100 


persons=2 car=unacc ^ lugboot=big 


- 


7-10 


- 


7-10 


0.333 0.333 0.667 


0.000 


lugboot=big car=good ^ safety=med 


o 


5-6 


o 


5 


1.000 0.042 1.000 


0.009 


car=v-good — > lugboot=big 


© 


3 


+ 


2 


0.615 0.069 0.978 


0.011 


car=unacc — > buying=v-high 


© 


4 


+ 


3 


0.298 0.833 0.344 


0.033 


car=v-good — > safety=high 


+ 


2 


+ 


1 


1.000 0.113 1.000 


0.025 


persons=4 ^ lugboot=big car=unacc 


- 


7-10 


- 


7-10 


0.153 0.239 0.641 


-0.020 


persons=4 safety=high — > car=acc 


o 


5-6 


o 


4 


0.563 0.281 0.938 


0.038 



Although the correlations in Table 01 are quite low, the tentative conclu- 
sion is that WRAcc correlates best with both intuitive notions of reliability and 
interestingness. This provides some preliminary empirical support for the idea 
that W RAcc provides the right trade-off between predictive and descriptive rule 
evaluation measures. 



Table 3. Rank correlations between two measures elicited from the expert and four 
rule evaluation measures. 





Acc Sens Spec WRAcc 


expert’s Rel 
expert’s Int 


0.150 0.152 0.116 0.323 
0.067 -0.006 0.029 0.177 



5.2 Rule Filtering 

The measures discussed in this paper are primarily intended for ranking and 
filtering rules output by an induction algorithm. This is particularly important 
in descriptive induction tasks such as association rule learning and database 
dependency discovery, since descriptive induction algorithms typically output 
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several thousands of rules. We briefly describe some preliminary experience with 
rule Altering using the functional dependency jdiscovery tool fdep p]. 

We ran fdep on some of the UCI datasets [T], and then used WRAcc to rank 
the induced functional dependencies. Below we give some of the highest ranked 
rules in several domains. They have the form Ai^ , A„ — > A, meaning ‘jgjven 
the values of attributes Ai,. . . , An, the value of attribute A is fixed”; see ^ for 
details of the transformation into H ^ B form. 

Lymphography : 

[block_lymph_c .regeneration, lym_nodes_enlar,no_nodes] -> [block_lymph_s] 
[lymphatics ,by_pass ,regeneration,lym_nodes_enlar] -> [lym_nodes_dimin] 

Primary tumor: 

[class .hist ologic_type ,degree_of_diff e .brain, skin.neck] -> [axillar] 

[class .hist ologic_type , degree_of_diffe ,bone_marrow, skin.neck] -> [axillar] 
[class .hist ologic_type ,degree_of_diff e .bone ,bone_marrow, skin] -> [axillar] 

Hepatitis: 

[liver_f irm, spleen_palpable , spiders .ascites .bilirubin] -> [class] 

[liver_big, liver _firm, spiders , ascites .varices .bilirubin] -> [class] 
[anorexia, liver_f irm, spiders , ascites .varices .bilirubin] -> [class] 

Wisconsin breast cancer: 

[uni_cell_size , se_cell_size ,bare_nuclei ,normal_nucleoli .mitoses] -> [class] 
[uni_cell_shape ,marginal_adhesion,bare_nuclei ,normal_nucleoli] -> [class] 
[uni_cell_size ,marginal_adhesion, se_cell_size ,bare_nuclei ,normal_nucleoli] 

-> [class] 

Our experience with rule Altering in these domains suggested that WRAcc{R) 
would drop quite sharply after the first few rules. Notice that in the last two 
domains the induced functional dependencies determine the class attribute. 

6 Summary and Discussion 

In this paper we have provided an analysis of selected rule evaluation measu- 
res used in machine learning and knowledge discovery. We have argued that, 
generally speaking, these measures should be used relative to some threshold, 
e.g., relative to the situation where this particular rule head is not connected to 
this particular rule body. Furthermore, we have proposed a single measure that 
can be interpreted in at least 5 different ways: as weighted relative accuracy, as 
weighted relative sensitivity, as weighted relative precision, as weighted relative 
negative reliability, and as novelty. We believe this to be a significant contribu- 
tion to the understanding of rule evaluation measures, which could be obtained 
because of our unifying contingency table framework. 

Further work includes the generalization to rule set evaluation measures. 
These differ from rule evaluation measures in that they treat positive and ne- 
gative examples symmetrically, e.g., RuleSetAcc{H <— B) = p{HB) +p{HB). 
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Another extension of this work would be to investigate how some of these mea- 
sures can be used as search heuristics rather than filtering measures. Finally, we 
would like to continue empirical evaluation of W RAcc{R) as a filtering measure 
in various domains such as association rule learning and first-order knowledge 
discovery. 
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Abstract. This paper reports the ongoing work of producing a state of 
the art part of speech tagger for unedited Swedish text. Rules elimina- 
ting faulty tags have been induced using Progol. In previously reported 
experiments, almost no linguistically motivated background knowledge 
was used m- Still, the result was rather promising (recall 97.7%, with a 
pending average ambiguity of 1.13 tags/word). Compared to the previous 
study, a much richer, more linguistically motivated, background know- 
ledge has been supplied, consisting of examples of noun phrases, verb 
chains, auxiliary verbs, and sets of part of speech categories. The aim 
has been to create the background knowledge rapidly, without laborious 
hand-coding of linguistic knowledge. In addition to the new background 
knowledge, new, more expressive rule types have been induced for two 
part of speech categories and compared to the corresponding rules of the 
previous bottom-line experiment. The new rules perform considerably 
better, with a recall of 99.4% for the new rules, compared to 97.6% for 
the old rules. Precision was slightly better for the new rules. 



1 Introduction 

The task of a part of speech (POS) tagger is to assign to each word in a text 
the contextually correct morphological analysis. POS tagging of unedited text 
is an interesting task because of several reasons: It is a well-known problem to 
which several different techniques have been applied, it is a hard task and it 
has real world applications (in text-to-speech conversion, information extrac- 
tion/retrieval, corpus linguistics, and so on). 

The current paper describes the ongoing work of creating a state of the art 
POS tagger for unrestricted Swedish text, based on ILP techniques. The rule 
based tagger is inspired by the Constraint Grammar approach to POS disam- 
biguation 0, and the rules are induced with the help of the freely available 
Progol ILP system m- The training material is sampled from the 1 million 
word Stockholm-Umea Corpus |0|. 

The experiments of this article build on previously reported bottom-line ex- 
periments [□HI, in which only very limited background knowledge was supplied. 



S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. i Rfi-li Q7^ 1999 . 
@ Springer- Verlag Berlin Heidelberg 1999 
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In the current work, Progol has been given a much richer, more linguistically 
motivated background knowledge. However, an important aspect when creating 
the background knowledge has been not to spend too much effort on hand-coding 
linguistic knowledge. The background knowledge has been produced using rat- 
her “quick and dirty” methods. In addition to the richer background knowledge, 
new, more expressive, rules have been induced. This study has been concentrated 
to the two most frequent POS categories of the training corpus, the verb and the 
noun, and the result is compared to the noun and verb rules of the previously 
reported bottom-line experiment. 

The current experiment has not been focused on comparing the result to other 
(machine-learning) approaches. Rather, the purpose has been to investigate to 
what degree supplying a richer background knowledge would improve accuracy 
compared to the previous results of . The result indicates that there is indeed 
quite a lot to gain: Comparing the new rules to those of the earlier experiment 
yields much better recall (99.4% v. 97.6%) and slightly better precision too. 

The paper is organized as follows: Section 2 provides the necessary back- 
ground on Constraint Grammar, the Stockholm-Umea Corpus, and summarizes 
some previous work. The main features of the current work are presented in 
Sect. 3. The result is given in Sect. 4, in Sect. 5 there is a discussion of the results 
and finally in Sect. 6 some future work is presented. 



2 Background 

2.1 Constraint Grammar 

Constraint Grammar (CG) is a successful approach to POS tagging (and shallow 
syntactic analysis), developed at Helsinki University jZ|. After a lexicon look- 
up, rendering the text morphologically ambiguous, rules discarding contextually 
impossible readings are applied until the text is unambiguous or no more rules 
can fire. The rules are hand-coded by experts, and the CG developers report 
high figures of accuracy for unrestricted English text. 

A nice feature of CG is that the rules can be thought of as independent of 
each other; new rules can be added, or old ones changed, without worrying about 
unexpected behaviour because of complex dependencies between existing rules 
or because of rule ordering. 



2.2 The Stockholm-Umea Corpus 

The Stockholm-Umea Corpus (SUC) is a POS tagged, balanced corpus of 
1 000 000 words of Swedish text [0|. The SUC tag set has 146 unique tags, 
and the tags consist of a POS tag followed by a (possibly empty) set of mor- 
phological features. There are 24 different POS categories. The first edition is 
available on CD-ROM and a second, corrected edition is on its way. 
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2.3 Previous Work on the Induction of Constraint Grammars 



Several experiments of inducing CGs have been reported |12I4I5I8| — all with 
promising results. While m used n-gram statistics, the other authors used ILP 
(and Progol). 

In 0, bottom-line experiments aimed at inducing rules for disambiguating 
the tag-set of the Swedish Stockholm-Umea Corpus were reported. The intention 
was to investigate whether ILP and Progol were useful tools to apply to POS 
disambiguation. Rules discarding contextually incorrect readings were induced 
one POS category at a time, for all of the 24 POS categories of the corpus. The 
background knowledge given to Progol was extremely poor; it merely consisted of 
access predicates used to pull out features from the context words. Furthermore, 
the context of the focus word was limited to a window of only a few words. The 
disambiguation rules could refer to word tokens as well as morphological features 
of the context words and the target word. The following rule discards all verb 
tags (vb) in the imperative (imp), active voice (akt) if the word immediately to 
the left of the target word is att (which is the infinitive marker or a subordinating 
conjunction): 



remove (vb, A) 

constr(A,left_target(word(att) ,feats( [imp, akt] ))) . 

The rules were tested on unseen test data of 42 925 known words (words in 
the lexicon), including punctuation marks. After lexicon look-up the words were 
assigned 93 810 readings, i.e., on average 2.19 readings per word. 41 926 words 
retained the correct reading after disambiguation, which means that the correct 
tag survived for 97.7% of the words. After tagging, there were 48 691 readings 
left, 1.13 readings per word. As a comparison to these results, a preliminary test 
of two different taggers El reports that the Brill tagger P, also trained on SUC, 
tagged 96.9% of the words correctly, and Oliver Mason’s HMM-based QTag P 
got 96.3% on the same data. Yet another Markov model tagger, trained on the 
same corpus, albeit with a slightly modified tag set, tagged 96.4% of the words 
correctly p. None of these taggers left ambiguities pending, and all handled 
unknown words (thus making a direct comparison of the results of El somewhat 
difficult). 



3 Current Work 

The current work investigates the possibility of increasing the tagging accuracy 
of |H1 by adding more linguistic background knowledge and by extending the 
expressiveness of the rules in other respects as well. These are the four main novel 
contributions of the current work compared to the earlier reported experiments 
of jbl8) : 

— More extensive background knowledge 

— New training example format, allowing for feature under-specification 

— New rule types 

— The possibility of feature unification 

— Variable window size 




Improving Part of Speech Disambiguation Rules 



189 



3.1 Training Data 

A lexicon was created from SUC, and each word in the training data was assigned 
an ambiguity class in which all possible morphological readings of the word 
were represented. Each tag was represented as a feature-value vector, where 
each feature had a name, e.g. ’P0S’=vb, ’VF0RM’=inf, ’V0ICE’=akt, etc. The 
features were named to allow for the rules to refer to a feature without specifying 
its value. Negative and positive examples for the noun and the verb tags were 
generated for the different rule types to learn. 

A positive training example in the new format for a select rule (see Sect. Ej) 
can be found below. The format is select (POSCategory, FocusWord, LHSCon- 
text, RHSContextJ^The focus word as well as the context words are represented 
by a quadruple, where the first item is the actual word form; the second is the 
frequency of the word form in the training data; the third is a binary feature 
which says whether a word starts with a capital letter (*) or not (-). The fourth 
element representing a word is a morphological reading, t/8, in which the first 
argument is the base (look-up) form of the word, the second through seventh 
arguments are morphological feature-value pairs, and the last argument is a 
frequency figure for that particular reading of the word form. 

7, ‘ ' Af f arsomraden Plastic Systems 7 procent av f aktureringen . ’ ’ 
select (nn, 

[procent , 488 , - , t (procent , ’ PDS ’ =nn , ’ GEN ’ =utr , ’ NUM ’ =plu , ’ DEE ’ =ind , 

’ CASE >=nom,-, 469)] , 

[ 

’7’ ,132,-,t(>7’ , 'P0S’=rg, ’ CASE’ =nom,-, 124) , 
systems, 6, *,t( ’Systems’ , ’P0S’=pm, ’CASE’=nom, 

-,-.-,-.6) , 

plastic, 4, *,t( ’Plastic’ , ’P0S’=pm, ’ CASE ’=nom, 4) , 
aff arsomraden, 1 , * ,t (af f arsomrade , ’PCS ’=nn, ’GEN’=neu, ’NUM’=plu, 
’DEF’=ind, ’CASE’=nom,-, 1) 

], 

[ 

av, 13973 , - , t (av, ’PDS ’=pp, - 13569) , 

f aktureringen, 12,-,t(fakturering, ’PQS’=nn, ’GEN’=utr, ’NUM’=sin, 

’DEF’=def , ’CASE’=nom,-,12) , 

’ . ’ ,57122, -,t(’ . ’ , ’P0S’=dl, ’DEL’=mad,-, 57122) 

1 ). 



The word frequency figures were used by the background predicates to avoid 
inducing rules referring to low-frequency word forms. The rules reported in this 
work were induced using the SICStus P-Progol implementation m , version 
2.7.4c. Due to rather long processing times (days in some cases), it was not 
feasible to perform a ten-fold cross validation of the induced theories. 



^ Following |1], the left hand side context {LHS Context) is reversed in order to make it 
easier for the background knowledge to refer to the words closest to the focus word. 
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3.2 Background Knowledge 

One of the main goals of this paper has been to provide Progol with potentially 
useful linguistic knowledge, which hopefully would yield a better, more compact 
and interesting theory. However, writing a natural language grammar manually 
is a very complicated and time-consuming matter. The ambition has been to 
provide a happy medium between the two extremes of providing virtually no 
linguistic background knowledge at all, and spending a lot of time and effort on 
coding high-level linguistic knowledge (such as e.g. a phrase structure grammar). 

Thus, the linguistic data added to the background knowledge consists of 

— Examples of noun phrase (NP) tag sequences 

— Auxiliary verbs 

— Auxiliary verb sequences 

— Examples of verb chains 

— Sets of POS categories with similar distributional statistics 

— Sets of ‘barriers’ 



Noun Phrases. A collection of non-recursive base NPs was extracted manually 
from one of the corpus files (TableP). These 196 noun phrases consist of the word 
tokens of the original text and a tag for each word. These noun phrases were 
reduced to 92 unique tag sequences where all word tokens have been removed. 
Examples of noun phrase tag sequences of the format npjtag seq{List) are found 
in Tabled where List is a sequence of morphological readings. 



Table 1. Manually extracted noun phrases 
"/, ‘nuclear weapons’ 

NP : k ’ ’ rnvapen ’ POS ’ =nn , ’ GEN ’ =neu , ’ NUM ’ =plu , ’ DEF ’ =ind , ’ CASE ’ =nom 
•/. ‘I’ 

NP: jag ’P0S’=pn, ’GEN’=utr, ’NUM’=sin, ’DEF’=def , ’PF0RM’=sub 

"/, ‘the Lithuanian member of parliament, Nikolaj Medvedjev’ 

NP: den ’P0S’=dt, ’GEN’=utr, ’NUM’=sin, ’DEF’=def 

litauiske ’POS ’=j j , ’DEG’=pos , ’GEN’ =mas , ’NUM’=sin, ’DEF’=def , ’ CASE’ =nom 
parlamentsledamoten ’P0S’=nn, ’GEN’=utr, ’NUM’=sin, ’DEF’=def , ’CASE’=nom 
nikolaj ’ POS ’ =pm, ’CASE’ =nom 
medved j ev ’ POS ’ =pm , ’ CASE ’ =nom 



Verbs. Different sets of verbs and verb sequences (“verb chains”) have been 
created by chopping off the top of frequency lists of verb sequences extracted 
automatically from the corpus. Auxiliary verb sequences of one or two cons- 
ecutive auxiliaries are given by the aux/ 1 and aux/2 facts that can be seen 
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Table 2. NP tag sequences 

"/, dt pc nn determiner, participle, noun 

np_tag_seq( [t ( ’ POS ’ =dt , ’ GEN ’ =utr/neu , ’NUM’=plu, ’DEF’=def) , 

t ( ’ PDS ’ =pc , ’ VFDRH ’ =prf , ’ GEN ’ =utr /neu , ’ NUM ’ =plu , ’ DEF ’ =ind/def , 
’CASE’=gen) , 

t(’PDS’=nn, ’GEN’=utr, ’NUM’=sin, ’DEF’=ind, ’ CASE ’ =nom) ] ) . 

"/, dt jj nn determiner, adjective, noun 

np_tag_seq( [t ( ’POS ’=dt , ’GEN’=utr , ’NUM’=sin, ’DEF’=def) , 

t(’P0S’=jj , ’DEG’=suv, ’GEN’=utr/neu, ’NUM’=sin/plu, ’DEF’=def , 

’ CASE ’ =nom) , 

t(’P0S’=nn, ’GEN’=utr, ’NUM’=sin, ’DEF’=def , ’ CASE ’ =nom) ] ) . 



in Table El Verb chains of one or more auxiliary verbs (possibly intervened 
by adverbs) followed by a full set of morphological features for the “content 
verb” which ends the verb chain are also supplied. The 126 most common 
verb chains are added to the background knowledge in the following format: 
vb-chain{[tokeni, ...,tokenn, reading]). A few examples are found in TableEl 



Table 3. Auxiliary verb sequences 

"/, ‘have’ "/, ‘would have’ °/, ‘can’/‘be able to’ "//would be able to’ 
aux(har). aux(skulle ,ha) . aux(kan) . aux(skulle,kunna) . 



Table 4. Verb chains 

"/, skulle kunna. . . (‘would be able to. . . ’) 

vb_chain( [token (skulle) , token (kunna) , t ( ’POS ’=vb, ’ VF0RM’=inf , 
’V0ICE’=akt)]) . 

"/, att kunna... (‘to be able to...’) 

vb_chain( [token (att) , token (kunna) , t ( ’POS ’=vb, ’ VF0RM’=inf , ’ VOICE’ =akt)] ) . 



Sets of POS Categories with Similar Distributional Statistics. Twenty- 
four different sets of two or three POS categories have been extracted from 
the training data semi-automatically. These sets of POS categories have been 
produced from tables of trigrams of the tags of the training corpus. Sets of POS 
categories that seemed to have a similar distribution were created by inspecting 
bigram frequency lists of tag pairs of words separated by one word in between. 
For each such tag pair, a frequency list of the tags occurring between the bigram 
tags was used to find tags which might have a similar distribution. In addition 
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to raw frequency, an association ratio score (or mutual information, see e.g. j^) 
was computed for each bigram and the tags which occurred in between, and the 
most frequent tags with the highest association ratio were lumped together to 
form a POS set. Association ratio indicates the extent to which items co-occur, 
and was computed using 



I{x,y) = log 



P{x,y) 

P{x)P{y) 



where P{x, y) is the probability of the POS bigram x occurring with the category 
y in between; P{x) is the probability of the bigram x and P(y) the probability 
of the category y. Examples of the resulting sets are found in Table 0 



Table 5. Examples of sets 



pos_set( [sn,hp,pc] ) 
pos_set( [ab, j j] ) . 
pos_set ( [dt , ab,ha] ) 
pos_set ( [ps ,ha] ) . 



pos_set ( [rg , nn , ro] ) . 
pos_set( [ab,ha,ps] ) . 
pos_set( [ab,pl] ) . 
pos_set( [pl,ha, j j] ) . 



of POS categories 

pos_set ( [pm,pn] ) . 
pos_set( [ab,ha] ) . 
pos_set( [ab, j j ,ps] ) 



Barriers. The sets of POS categories used in the barrier rules have been collec- 
ted in an even simpler manner: By the help of frequency lists of POS bigrams, 
the categories which occur most frequently before the category that the barrier 
rule was intended for have been put into a set. For instance, the categories most 
frequently occurring immediately before the verb are {nn,pn,ab,hp,ie} (noun, 
pronoun, adverb, WH-pronoun, infinitive marker), and can be used by barrier 
rules for discarding faulty verbal readings. Barrier rules are explained in Sect. El 

3.3 The Rule Types: Remove, Barrier and Select 

Remove rules, barrier rules and select rules have been induced. These rule types 
are all presented in H2]. Sets of “lexical” rules have also been induced. Each of 
the rule types will be described in the next four sections. 

The window size of the select and remove rules is not fixed, but depends on 
the size of a potential background knowledge noun phrase or verb sequence (see 
Sect . 13. 211 . 

Different numbers of training examples have been used for the different rule 
types, depending on the complexity of the background knowledge. For the barrier 
rules, 2 000 positive and 6 000 negative examples have been used. The other rule 
types had up to 10 000 positive and 20 000 negative examples. 

The format of the rules presented below is not the actual Progol output 
format, but a common intermediate format, which the Progol rules have been 
translated into. 
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Remove Rules. Remove rules are used to remove unwanted readings of a word. 
The remove rule can look at specific features of words and also take advantage 
of the linguistic background knowledge (described in Sect. 13.211 . 

The rule below is an example of a remove rule that removes every noun 
reading (nn) of the word /o?0 (‘for’, ‘to’, ‘prow’, ‘of’ . . . ) if the first word to the 
right is an auxiliary verb (aux). 

nn remove 

t arget : t oken=f -r , 
right 1; sequence=aux. 



Barrier Rules. Barrier rules allow for an arbitrarily long context, given that 
certain features are not present in the context. The features not allowed are called 
barriers (which in this case consist of POS categories) and block out exceptions 
to the rule which would otherwise have made it invalid. 

The barrier rules of the current work consist of at most three parts: a con- 
straint on a word on an unspecified position to the left of the target, a barrier, 
and a constraint on the target word. 

Below, an example of an induced rule which removes the noun (nn) reading of 
the word man (‘man’, ‘one’ (pronoun), ‘mane’) if there is a verb (vb) somewhere 
(*) to the left, and no determiner (dt) or adjective (j j) occur between the verb 
and the target word. 



nn barrier 

left *: pos=vb, 
barrier: 

target: token=man. 



Select Rules. In a sense, the select rule contradicts the philosophy of Constraint 
Grammar, in that it says which reading is the correct one and should be kept, 
discarding every other reading. In some contexts, it might be easier to identify 
the correct reading rather than the faulty ones, and it is in these cases the select 
rule is useful. 

The following rule says that if the word to the left is alia (‘all’, ‘everyone’), 
the plural noun reading(s) should be selected, and every other reading discarded. 

nn select 

left 1: token=alla, 
target: num=plu. 

^ The word for appears more than 11 000 times in the training data and has an 
ambignity class of size eight. 
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Lexical Rules. In the current work, the term “lexical rule” is used somewhat 
differently from how it is used in ca, where a lexical rule discards a given reading 
for a specific word form without any contextual constraints at all. 

The lexical rules of the current work are allowed to look at context word 
forms and they have a fixed window of maximally two words to the left and 
two words to the right, plus the focus word. An advantage of using lexical rules 
is that they can be applied even if the context is ambiguous. They take care 
of frequent, easy cases of ambiguities, making it possible for other rules, which 
demand an unambiguous context, to trigger. 

Below are two lexical rules, the first one discarding the noun (nn) reading of 
the word var (‘was’, ‘pus’, ‘(pillow) case’) if it is preceded by the word det (‘it’). 

The second rule says that a specific noun reading should be selected, if the 
target word is preceded by the two-word sequence ett bra (‘a good’). 

nn remove 

left 1: token=det, 
target : token=var . 

nn select 

left 1: {token=ett I token=bra}, 

target : {pos=nn,gen=neu,num=sin,def =ind, case=nom} . 



3.4 Under-Specification and Feature Unification 

The format of the examples has been changed to allow for more powerful rules, 
e.g. by making it possible to induce rules which have under-specified feature 
values. 

A benefit of using a rich language such as Horn clauses, compared to for 
example n-gram statistics, is that it is easier to make use of feature unification. 



Under-specified Feature Values. In the previous work (0EI), only feature 
values were extracted. In other words, the rules could not refer to a feature 
without having to specify its value. This is an unnecessary restriction which 
could make the resulting theory larger than it has to be. Separating the feature 
name from its value allows for constraints that, for example, cover both nouns 
and adjectives in the singular and the plural. 

The following select rule is an example of a rule with an under-specified 
feature value (select rules are described in Sect.|2I). It says that the plural noun 
reading should be kept and every other reading should be removed if the target 
word is folk (‘people’, ‘folk’ — singular or plural), and the word to the left has 
the feature VFORM, but that its value is of no importance (thus covering both the 
verb and the participle which have this feature in common). 

nn select 

left 1: vform=_, 

target: {token=folk, num=plu} . 
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Feature Unification. Given the proper Progol mode declarations, it is possible 
for a feature to take a variable as its value. In fact, two features can share the 
same variable, forcing the values of the two features to unify. This makes it 
possible to induce more compact rules. Feature unification has, to our knowledge, 
not been used in the original Constraint Grammar. 

In Swedish, adjectives show number (and gender) agreement with the noun 
on which they are dependent. An example of a rule that makes use of feature 
unification can be seen below. The rule forces the number (num) of the target 
word stora (‘big’, adjective, plural or singular) and the first word to the right to 
unify, and selects the agreeing adjective (jj) analysis of the target word. 

jj select 

target: {token=stora, num=V}, 

right: num=V. 

The above rule would for instance select the singular reading of stora in the 
phrase den stora fdgeln (‘the big bird’), but select the plural reading of stora in 
de stora fdglarna (‘the big birds’). 

4 Results 

The rule types described above were induced for the two most frequent POS 
categories of the training data, the noun (nn) and the verb (vb). The corpus was 
split into a 90% training set and a 10% test set, in which the different files were 
evenly distributed over the different text genres. The training set was further 
split into a training set proper and an evaluation set. The evaluation set was 
used to identify and remove bad rules. The surviving rules were tested on 8 780 
ambiguous words from the unseen test data. The test words all had at least one 
verb and/or noun reading, in all 35 678 readings (« 4 readings/ word). The exact 
same data was also disambiguated with the help of the old noun and verb rules of 

The old rules made 210 mistakes (discarding the correct reading), and there 
were 28 601 readings left after disambiguation (obviously leaving readings which 
belong to POS categories other than the noun or verb). The new rules made 57 
mistakes and left 28 546 tags pending. This means that the new rules show a 
significant improvement in recall, 99.4% compared to 97.6%, while keeping the 
ambiguity at about the same level as in the bottom-line experiment. This appears 
to be a promising result. If it holds for the rest of the POS categories, there will 
be space for applying less reliable rules to reduce the pending ambiguities to a 
more acceptable level than the 1.13 tags/ word reported in 



Table 6. Result 





Old rules 


New rules 


Readings pending 


28601 


28546 


Recall 


97.6% 


99.4% 
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5 Discussion 



As mentioned in Sect . 12. d I a state of the art tagger trained on the Stockholm- 
Umea Corpus typically has a recall of 96.3-96.9%, with no remaining ambiguities. 

There are no results reported for these taggers for the individual POS cate- 
gories. The rules induced in the work reported in this paper should primarily be 
compared with the rules induced in jSj, rather than to other taggers. However, 
a small experiment of running the QTag on nouns and verbs yields a recall of 
95.2%, with no remaining ambiguities. How this result should be compared in a 
fair way to the result reported in this paper is unfortunately not obvious. 

Some of the rules induced in the current work refer to sets of POS categories, 
which makes these less dependent on a disambiguated context. This is also true 
for the rules which make use of under-specified feature values. We believe that 
when rules for all POS categories have been induced this will result in not only 
better recall, but also better overall precision, compared to the rules of 0. 

In an early stage of the work described here, a class of verbs of “saying”, 
denoting things like “say”, “propose”, etc, was added to the background know- 
ledge. However, it turned out that this (semantic) category was quite useless for 
the task at hand, and was never used in any rules. 



6 Future Work 



When collecting the noun phrases used in the current investigation, they were 
initially divided into six different categories, which were lumped together in the 
background knowledge finally used in the rule induction. Perhaps it would be 
worth while to split the NP example base into these categories, and see if there 
are some NP categories that are more useful than others. 

An important task of a useful tagger is that of handling unknown words 
(words not in the lexicon); however big a lexicon might be, new words will 
always turn up. Just consider proper nouns or foreign (loan) words. In Swedish, 
which has a much richer morphology than English, new words can for instance 
be created by concatenating two or more existing words — possibly by infixing a 
“glue s” . For example, rusdrycksfdrsdljningsforordning (‘law governing the sale 
of liquor and wine’) consists of the concatenated words rus-dryck-s-fdrsdljning- 
s-forordning. This way of extending the lexicon is a productive process, not at 
all unusual. 

Replacing the previously induced rules (with minimal background knowledge) 
with the rules reported in this paper would increase recall without increasing 
the ambiguity. If the results of the current investigation holds for the rest of 
the POS categories it means that the performance of the tagger will increase 
considerably. We consider the results of this study to be good news, and will 
continue by producing rules for all the 24 POS categories of the corpus. 
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Abstract. Shinohara, Arimura, and Krishna Rao have shown learna- 
bility in the limit of minimal models of classes of logic programs from 
positive only data. In most cases, these results involve logic programs 
in which the “size” of the head yields a bound on the size of the body 
literals. However, when local variables are present, such a bound on body 
literal size cannot directly be ensured. The above authors achieve such 
a restriction using technical notions like mode and linear ineqnalities. 
The present paper develops a conceptually clean framework where the 
behavior of local variables is controlled by nonlocal ones. It is shown that 
for certain classes of logic programs, learnablity from positive data is 
equivalent to limiting identification of bounds for the number of clauses 
and the number of local variables. This reduces the learning problem 
finding two integers. This cleaner framework generalizes all the known 
results and establishes learnability of new classes. 



1 Introduction 

Recently, there has been considerable interest in deriving theoretical learnability 
results for Inductive Logic Programming. A lot of this work has been in the 
framework of Probably Approximately Correct (PAG) learning. The reader is 
referred to Dzeroski, Muggleton, and Russell f/liSj . Cohen [4l,5j . De Raedt and 
Dzeroski jOj, Frisch and Page |2|, Yamamoto m, Kietz H21, and Maass and 
Turan HS| for a sample. 

Unfortunately, most of the positive results in the PAC setting are for very 
restricted classes of logic programs. PAC is a very strict learning criterion more 
suited to learnability analyses of less expressive concepts representable in pro- 
positional logic. This necessitates exploring less strict learnability criteria like 
identification in the limit for learnability analyses of logic programs0 

* Supported by the Australian Research Council Grant A49803051. 

^ The learnability analyses for ILP in the learning by query model overcomes some 
of the restrictive nature of the PAC model by allowing the learner queries to an 
oracle. However, these oracles may sometimes be noncomputable. For examples of 
such analyses, see Khardon CH and Krishna-Rao and Sattar M- Also, it may be 
argued that concrete learning models proposed for ILP (like Muggleton and Page’s 
U-learnability) have an identification in the limit flavor. 



S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 198- 120^ 1999. 
@ Springer- Verlag Berlin Heidelberg 1999 
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The first identification in the limit result about learnability of logic programs 
is due to Shapiro ini- He showed that the class of h-easy models is identifiable 
in the limit from both positive and negative facts. In this paper we are mainly 
eoneerned with learnability from only positive faets. Shinohara m showed that 
the class of minimal models of linear Prolog programs consisting of at most m 
clauses is identifiable in the limit from only positive facts. Unfortunately, linear 
logic programs are very restricted as they do not even allow local variables (i.e., 
each variable in the body must appear in the head). Arimura and Shinohara 
P] introduced the class of linearly-covering logic programs that allows local va- 
riables in a restricted sense. They showed that the class of minimal models of 
linearly-covering Prolog programs consisting of at most m clauses of bounded 
body length is identifiable in the limit from only positive facts. Krishna Rao H3I 
noted that the class of linearly-covering programs is very restrictive as it did 
not even include the standard programs for reverse, merge, split, partition, 
quick-sort, and merge-sort. He proposed the class of linearly-moded programs 
that included all these standard programs and showed the class of minimal mo- 
dels of such programs consisting of at most m clauses of bounded body length 
to be identifiable in the limit from positive facts. 

In this paper, we present a general framework for learnability of minimal 
models of definite logic programs from positive facts that covers all the above- 
mentioned cases and more. In the rest of this section, we give an informal dis- 
cussion of our results. 

A frame is a pair consisting of a class C of Herbrand structures and a recursive 
enumeration of definite logic programs such that: 

1. the minimal Herbrand model of each program in {Ti \ i G N} belongs to 

C and each structure in C is the minimal Herbrand model of a program in 

N}; 

2. the class of structures represented by the sequence of programs is an indexed 

family of recursive structure^ and 

3. all nonempty subsets of a program (viewed as a set of definite clauses) in 

the sequence (T))jg^ are also in the sequence. 

Intuitively, C is the concept class of the frame and the sequence is 

its hypothesis space. 

The positive data about a structure is modeled as a text. A text for a structure 
is any enumeration of all atomic sentences in the language true in the structure. 

In the sequel, let = {C,H) he a frame. 

We say that C is learnable in T just in case for any structure 5 in C, a 
program for S in H can be determined in the limit from any text for S. 

In the present paper we tie the learnability of concepts to the learnability of 
bounds for the number of clauses and bounds for the number of local variables. 
We informally introduce these latter notions next. 

We say that bounds for the number of clauses are learnable in T if for any 
structure 5 in C, a bound for the number of clauses of a program for S in H can 

^ This means that some effective procedure decides whether a is a logical consequence 
of Ti, for all atomic sentences a and i £ N . 
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be determined in the limit from any text for S. Similarly, we say that bounds 
for the number of local variables are learnable in T if for any structure S in C, 
a bound for the number of local variables for all clauses of a program for S in 
Ti. can be determined in the limit from any text for S. 

We show that C is learnable in T just in case: 

1. bounds for the number of clauses in T are learnable; 

2. bounds for the number of local variables in T are learnable; and 

3. for natural numbers ni and U 2 and for each atomic sentence a, there are 
only finitely many programs with no more than clauses each of which 
have no more than U 2 local variables that entail o;|j 

The above may be viewed as an extension of Shinohara’s m framework and 
condition (c) is very similar to Shinohara’s notion of finite bounded thickness. 
However, to simplify the application of the above sufficient condition, we consider 
special cases of hypotheses spaces. 

In Sect. 0we introduce the semantic notion of admissible programs such that 
if the hypothesis space Ti, of frame T consists only of admissible programs, then 
the concept class C is learnable in T if and only if bounds for number of clauses 
and bounds for number of local variables are learnable in T . 

The notion of admissible programs however is semantic and it cannot be 
determined from syntax whether a definite program is admissible. Also, all ad- 
missible programs in a language cannot be cast in a frame as the class of all 
admissible programs is not recursively enumerable. 

Hence, the next two sections address frames whose hypothesis spaces consist 
of programs whose admissibility can be determined syntactically. Section 0 in- 
troduces the notion of strongly admissible programs that generalize the class of 
linear programs due to Shinohara m and reductive programs due to Krishna 
Rao m- Strongly admissible programs, however, do not allow local variables. 
Section 0 considers the interesting case of safe programs that allow local varia- 
bles. The main idea in their definition is to “bound” all atomic formulas in which 
local variables occur by atomic formulas without local variables in such a way 
that the resulting program is strongly admissible. This idea may be viewed as 
a generalization of the very technical approaches taken by Arimura and Shino- 
hara 0 in lineraly-covering programs and by Krishna Rao IE] in linearly-moded 
programs. 

In Sect.0 we give examples of safe programs, including exponentiation, an 
example that cannot be shown to be learnable according to Krishna Rao ’s HSI 
linearly-moded programs. We now proceed formally. 

2 General Notation and Terminology 

Let £ be a finite first-order language without equality, whose set of variables is 
{vi I i € N}. Let K > 1 and {t-k | fc < k} be such that {pk \ k < k} enumerates the 
predicates of £ and for all k < k, Lk is the arity of pk- Cterm denotes the set of 

® More precisely, the programs have to be mimimal with respect to inclusion, which 
means that none of their strict subsets implies a. 
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£-terms, Lcterm the set of closed £-terms (ground terms), Lat the set of atomic 
^-formulas, jOatsen the set of atomic £-sentences {i.e. closed atomic ^-formulas, 
or ground facts), C*^tsen the set of finite sequences of atomic ^-sentences, Cat[k], 
k < K, the set of atomic £-formulas of form pk{t\ • ■ • tt^), ti • • ■ € Cterm- Given 

k < K, ti . . G Cterm, and 1 < t < tfe, if tp denotes the formula Pk(ti . . . t^,,) 
then ip{i) denotes the term ti. Given clause C, Varioc(C) denotes the set of local 
variables of C {i.e. variables that appear in the body of C, but not in its head). 
Without loss of generality, we assume that for every clause C and f > 0 , if Vi 
occurs in C then all of uq . . . Vi-\ occur in C before each occurrence of Vi in C. 
Given u which is either a term or an atomic formula, and given partial function 
h : Var Cterm, u[h] denotes the result of substituting in u every occurrence 
of every x in the domain of h by h{x). We always write “program” instead 
of “definite program.” For all programs T, T-Lx denotes the minimal Herbrand 
model of T. If e denotes a sequence whose index set is N then for all k G N, e[k] 
denotes the initial segment of length {k + 1) of e (e.g., if e = (cq, Ci, C2 . . .) then 
e[ 3 ] = (60,61,62) and e[ 0 ] = ()). The cardinality of a set X is denoted card(X). 
Given set X of sets and property P on X, a minimal member of X that satisfies 
P is any E G X such that E satisfies P and no F G X which is strictly included 
in E satisfies P. By cofinitely many, we mean all but a finite number of. 

3 Frames 

Definition 1. Given class C of Herbrand structures and recursive enumeration 
{Ti)i^N of programs, we say that the pair (C, (Ti)jg^) is a frame just in case: 

1 . C = {HTAiG N}; 

2 . {HTi)ieN is an indexed family of recursive Herbrand structures; 

3 . every nonempty subset of a program in {Ti \ i G N} belongs to {Ti \ i G N}. 

Definition 2. A text for a Herbrand structure S is an enumeration of all atomic 
sentences a such that 5 )= a. 

Definition 3 . Let frame T = (C, (Ti)igiv) be given. 

We say that bounds for the number of clauses are learnable in T just in case 
there is partial recursive function E : ifatsen ^ ^ ^^6 following property. 

For all S G C and texts e for S, there is n G N such that: 
i) E{e[k]) = n for cofinitely many k G N; 
ii) there is i G N such that Ht, = S and card{Ti) < n. 

We say that bounds for the number of local variables are learnable in P just 
in case there is partial recursive function A : ifotsen ^ ^ ^^6 following 

property. For all S G C and texts e for S, there is n G N such that: 

Hi) A{e[k]) = n for cofinitely many k G N ; 

iv) there is i G N such that Hr, = S and card{Varioc{C)) < n for all C G Ti. 

We say that C is learnable in T just in case there is partial recursive function 
E : >C*isg„ ^ {Ti I i G N} with the following property. For all S G C and texts e 
for S, there is i G N such that: 
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v) if'(e[fc]) = i for co finitely many k G N; 

vi) Ti.Ti — 

Proposition 1. Let frame T = (C, (Ti)igjv) such that the following holds: 

1. bounds for the number of clauses are leamable in J-; 

2. bounds for the number of local variables are learnable in T ; 

3. for every ni,U 2 G N and a G Latsen, the set of all minimal T G {Tj | i G N} 
such that T \= a, card(T) < n\, and card{Varioc{C)) < ri 2 for all C GT, is 
finite. 

Then C is learnable in T . 

Proof. We first define a partial recursive function <1> whose domain is C*atsen 
whose codomain is the set of all finite subsets of {Ti \i G N}. Since bounds for 
the number of clauses (respect, local variables) are learnable in T, choose partial 
recursive S : C*^tsgn tSf (respect. A : £atsen that satisfies conditions i) 

and ii) (respect, iii) and iv)) of Definitional Let k G N and ao . . .ak G Catsen 
be given. We define <P{ao ...a^). Towards this aim, we construct by induction 
a sequence {Fq ... Fk) of finite subsets of {Ti\i G N}. Let Fq be the set of all 
Tj’> j ^ k, such that Tj is a minimal member of {Ti \ i G N} with Tj ^ oq, 
card{Tj) < E{ao ■ . .ak), and card(Varioc(C)) < A{aQ...ak) for all C G Tj. 
Let p < k he given, and suppose that Fp has been defined. We define Fp+i. 
Given X G Fp, let Gx be the set of all Tj, j < k, such that Tj is a minimal 
member of {Ti\i G N} with Tj ^ cTp+i, card{X Li Tj) < E{ao...ak), and 
card(Varioc(C)) < A{ao . . . Ok) for all C G Tj. Set: 

Fp+i = FpU{XUY\XG Fp, X ^ Op+i, F G Gx}- 
Then set <?(ao . . . au) = Fk. This completes the definition of T>. Now we define 
from a partial recursive function W : ^ {Ti \ i G N}. Fix an enumeration 

(Pi)ieN of Catsen- Let Computable function 0 whose domain is the cartesian 
product of N with the set of all finite subsets of {Ti \ i G N}, whose codomain 
is the set of all finite sequences of members of {Ti | i G N}, satisfy the following. 
For all k,n G N and members Xq . . . of {Ti \ i G IV}, 0{k, {Xq . . . Xn}) orders 
Xq . . . Xn in such a way that: 

1. for all r,s < n, Xr occurs before Xg in 0{k,{XQ . . . Xn}) whenever {* < 

k\Xr\= Pi] C {i <k\Xs\= Pi}-, 

2. for all k' G fV, if {z < k\X^ ^ j3i} C {i < k\Xg ^ j3i} and {i < 
k' \Xn ^ (di\ C {i < k' \Xs ^ /3i} are equivalent for all r,s < n, then 
0{k,{Xo...Xn])=0{k',{Xii...Xn}). 

Let k G N and Oq . . . Ofe G Catsen be given. We define T{aQ . . . ak). Suppose there 
is n G N, Xq . . . Xn G {Ti \ i G N}, and least i < n such that 0{k, T>{ao . . . ak)) = 
(Xo . . . Xn) and {ao . . . a^} C {a G Catsen I h «}• Then set F(ao ...ak) = 
Xi. Otherwise T{ao . . . ak) is undefined. This completes the definition of F. 

Let S G C and text e = {ak)keN for S be given. To finish the proof it 
suffices to show that there is i G N such that F satisfies conditions v) and vi) 
of Definition □ By the definitions of r; and A, let K,ni,U 2 G N be such that 
S{e[k]) = ni and A{e[k\) = U 2 for all k > K, and let } G fV be such that 
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Htj = S, card{Tj) < m, and card{VariociC)) < U 2 for all C S Tj. We define 
by induction sequences (fco • ■ • kn^) and (fo . . . of integers. Set fcp = 0- Let 
io G fV be such that is a minimal member of {T^ 1 1 G iV} with ^ ag, and 
Tig C Tj. Let p < ni be given, and suppose that kp and ip have been defined. 
We define fcp+i and ip+i. If Tj are logically equivalent then set 

kp^i = kp and ip+\ = ip. Suppose otherwise. Let fcp+i > kp be least such that 
[jg<p Ti^ ^ Q^fcp +1 ■ Let ip+i G iV be such that is a minimal member of 

{T^ i G N} with |= and C Tj. Trivially, Up<m is a subset 

of Tj which is logically equivalent to Tj. Fix K' > iL, ...i^. It is easy 

to verify that Up<rii Ti^, belongs to ^(ag . ■ .ak) for all k > K' . Moreover, it is 
easy to show that there is finite E C {Ti \ i G N} such that <?(ag . . . ak) = E for 
cofinitely many k G N. With the definition of 0, we infer that there is K" > K' , 
n G N, and Xq . . . G {Ti \ i G N} such that 0{k, >F(ag . . . ak)) = (-^g . . ■ X^) 
for all k > K" , and there is least p < n such that Hx = S. By the definition of 
0 again, Hxp — E-x, ^ 0 for all q < p. Let K'" > K" be such that for all q < p, 
{ag . . . ax"'} — "Hx ^ 0- It follows immediately from the definition of T that 
for all k > K'" , T{aQ . . . ak) = Xp. So i G N with Xp = Ti satisfies conditions 
v) and vi) of Definition 0 as required. □ 

An immediate modification in the proof of Proposition ^ shows the following 
result, due to Shinohara PB]. 

Proposition 2. Let frame F = (C, (Ti)i^x) be such that the following holds: 

1. bounds for the number of clauses are learnable in F ; 

2. for every n G N and a G Eatsen, the set of all minimal T G [F \ i G N} such 
that r 1= a and card(T) < n, is finite. 

Then C is learnable in F . 

4 Admissible Programs 

To apply Proposition ^ we have to consider frames which satisfy the condition 
expressed in ©. Towards this aim, it suffices to consider frames built from 
programs T such that: 

1 . to any atomic sentence a is associated a complexity measure such that only 
finitely many atomic sentences are at most as complex as a; 

2. all sentences that are derived thanks to any clause in T are at least as 
complex as the premisses. 

The following notation is used for the remainder of the paper; it captures the 
abstract notion of complexity measure we need. 

Definition 4. ^ denotes a pre-order on Catsen such that for all a G Catsen, 
Ea = {/3 G Eatsen \ P El Oi\ is finite, and {(a, E^) \ a G Eatsen} is computable. 

Here are two natural examples of such pre-orders. 

Example 1. Given t G Ecterm, denote by sizeft) the total number of occur- 
rences of constants and function symbols occurring in t. Let k,k' < k and 
ti . . .t,f^,t{. . .t[^, G Ecterm be given. Set pk{ti . . .t^fi) EPk’{t{ . . just in 

case maxi<j<t,, sizelfi) < maxi<i<t^, size(fi). 
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Example 2. We define inductively the heigth of a closed term t and denote it by 
height(t). All constants have a heigth of 0. Let n > 1, ti . . . tn G E-cterm, and 
k G N he least such that the height of all tp is at most equal to k. Then for all 
n-ary function symbols /, the height of /(ti . . . is equal to fc-|- 1. Let k,k' < k 
and h . . G Ccterm be given. Set pk{ti . . . ^ Pk> ) just 

in case maxi<j<tj^ height{U) < maxi<i<t^, height(tl). 

Relying on ^ , we can now define a set of programs that fulfill the former requi- 
rements. 

Definition 5. Let Herbrand structure S and clause C = Aq <— Ai . . . Ai be 
given. We say that C is admissible with respect to S just in case for every 
h : Var —>■ Ccterm, if An[h] G S for all n < £ then An[h] ^ Ao[h] for all n <£. 

A program is admissible if all its clauses are admissible with respect to its least 
Herbrand modeZfl 

Lemma 1. For every recursive enumeration (Ti)igjv of admissible programs, 
(^tJ ig N is an indexed family of recursive Herbrand structures. 

Proof. It suffices to show that there exists a partial recursive function <P such that 
for all admissible programs T and atomic sentences a, ^(T, a) = 1 if Ht h 
and 'P{T, a) = 0 otherwise. We describe informally how <L> reacts faced with 
program T and atomic sentence a. By Definition 0 let E be the finite set of all 
ip G Catsen such that tjj < a. Let Fq be the set of all closed instances of members 
of T which belong to E. Let i G TV be given, and suppose that Fi has been 
defined. Let Tj+i be the union of Tj with the set of all tp G E such that for 
some clause C = Aq ^ A\ ... A^ In T and sequence {tpi . . .tpi) ol members of Fi, 
tp -I— tpi .. .tpi is a closed instance of C. Let i G N he least such that Fi = Fi+i. 
Set ‘P{T,a) = 1 A a G Fi, and <P{T,a) = 0 otherwise. It follows easily from 
Definition 0 that if T is admissible, then d>{T, a) = 1 whenever T-Lt H oh, and 
^(T, a) = 0 otherwise. □ 

Proposition 3. Let frame T = [C, (Tj)igjv) such that for all i G N , Ti is an 
admissible program. Then the following conditions are equivalent. 

(a) Bounds for the number of clauses and bounds for the number of local varia- 
bles are learnable in T . 

(b) C is learnable in T . 

Proof. Trivially, (b) implies (a). Suppose that (a) holds. Let a G Catsen and 
n G N he given. Let E be the set of admissible programs T such that: (1) 
T 1= a, (2) for all T' C T , T' a, and (3) all members of T contain no more 

than n local variables. 

By Proposition Dit suffices to show that E is finite. Let T G E he given. 
By (1) let * G and sequence (oq ■ ■ - cti) of atomic sentences be a proof of a 
from T, i.e., at = a and for all j < i, there is a clause C = Aq <— Ai ... Ai in 

* It should be noted that the class of admissible programs is more general than the 
class of acyclic programs defined in P]. 
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T, ji ■ ■ .je < j, and h : Var ^ Ccterm with Aq[K\ = Uj and An[h] = aj^ for all 
I < n < £. Without loss of generality, we can suppose that (ag • ■ • Oi) is minimal, 
i.e., there is no i' G N and sequence (og . . . a'/) of atomic sentences which is a 
proof of a from T with {o' \j < i'} C {aj \j < i}. Since T is admissible, aj ^ at 
for all j < i. Hence {aj | j < *} is a subset of the finite set Yi of all /3 G Catsen 
such that P ^ a. Let Y 2 be the set of all atomic formulas x such that for some 
tjj G Yi, tfj is a,n instance of x, and let Z be the set of clauses built from l 2 - Note 
that for every t G Ccterm there are, up to a renaming of variables, only finitely 
many t' G Cterm such that t is an instance of t' . Since Yi is finite, this implies 
that Z is finite (recall the convention on the variables appearing in a clause). 
By (2), for every atomic formula A occurring in T, there is j < i such that aj 
is an instance of A. Hence every atomic formula that occurs in T belongs to 1^2 • 
This, (3), and the fact that Z is finite imply immediately that E is finite, as 
required. □ 

Corollary 1. Let G N and frame T = (C, be such that for all 

i G N , Ti is an admissible program which contains at most pL clauses, each of 
them containing at most v distinct local variables. Then C is learnable in F . 



5 Strongly Admissible Programs 

The notion of admissible program has a semantic character since it relies on 
least Herbrand models. Unfortunately, all admissible programs cannot be cast 
in a frame. We will now consider two kinds of frames built from programs whose 
admissibility is established thanks to a syntactic examination. The first kind of 
frames are limited to programs without local variables (provided that Ccterm is 
infinite, which is the interesting case). This is achieved technically as follows. 

Definition 6. Let binary relation R on Cat be defined as follows: for all il),Lp G 
Cat, ipRip just in case for all h : Var Ccterm, 'f’[h] It is easy to ve- 

rify that R is a pre-order on Cat which extends ^ . Lt will also be denoted < . 
Moreover, we now suppose that {(pp, (p)\'ip^gy, ip,T C Cat] is computable. 

Definition 7. A program is strongly admissible just in case for every clause 
Ag <— Hi . . . in T and n < i, A„ ^ Aq. 

Example 3. Examples Q and 0 define pre-orders on Ccterm whose extensions to 
Cterm Satisfy the condition expressed in Definition El Hence all reductive pro- 
grams are strongly admissible (for the notion of reductive program, see [1 3j). 

In case Ccterm is infinite, it is easy to verify that for all G Cat with ip ^(p, 
every variable that occurs in ip also occurs in ip. Hence: 

Lemma 2. Suppose that Ccterm is infinite. Then all strongly admissible pro- 
grams have no local variables. 

Directly from Definitions 0 0 and 0 we have Lemmas 0 and 0 
Lemma 3. The set of strongly admissible programs is computable. 
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Lemma 4. Every strongly admissible program is admissible. 

Directly from Lemma n Proposition 0 and Lemmas 0 and 0 

Proposition 4. Let (Ti)i^iq be a computable enumeration of all strongly admis- 
sible programs. Then the following conditions are equivalent. 

1. Bounds for the number of clauses are learnable in ({Ht; | * G N}, (Ti)igAr). 
{"^Ti \ i € N} is learnable in ({Ht; | i G N}, (Ti)igAr). 

6 Safe Programs 

Now we investigate the interesting case of programs with local variables. The 
key idea is to “bound” all atomic formulas in which local variables occur with 
atomic formulas without local variables, in such a way that the resulting program 
is strongly admissible. To this aim, we generalise the partial pre-order ^ to a 
partial pre-order on the set of >C-terms. 

Definition 8. ^ now also denotes a pre-order on Cterm with these properties: 

1. For all t,t’ e Cterm, t<t' iff t[h] <t'[h] for all h : Var ^ Ccterm- 

2. Given k,k' < k and h . . .t,^,t[ . . .t[^, G Cterm, Pk{h . . .t,^^) <Pk'{t'i ■ . - t[^,) 
just in case for all 1 < i < bk, there is 1 < j < bk' with C ^ tj. 

3. Let X be the set of all (to . . . t„, tg . . . t(j) S Cterm‘^'^~^^ , n G N, that satisfy 
the following. For all h : Var Cterm, if tm[h] d,t'.^[h] for all m < n then 
tn[h] d,tn[h]. Then X is computable. 

Example j. Consider the pre-order A on Cat defined in example 0 (respect, 
example 0. Given t, t' G Ccterm, set t<t' just in case sizeff) <size(t') (respect. 
height(t) <height{t')). The conditions expressed in Definition 0 are satisfied. 

Definition 9. Let program T be given. A subset X of[Ji,^^{Cat[k] x {1 . . . bk}^) 
is a bound kit for T just in case for all clauses Aq *— Ai ... in T and h : 
Var — > Ccterm, (*) bclow implies (f) below. 

(*) For all 1 < j < d and {<p,p,q) G X with ip ^ Aj, Aj{p)[h] <Aj{q)[h]. 

(t) For all {p,p,q) G X such that p[h'] = for some h' : Var —>■ Cterm, 

AQ{p)[h]diAQ\q)[h]. 

It is easy to verify the following, by induction on the length of proofs. 

Lemma 5. Let program T, a G Catsen with T |= a, and bound kit X for T be 
given. Then for all {<p,p, q) G X with p\= a, a{p) ^ a{<f) . 

Definition 10. A program T is safe just in case for all clauses Aq <— Ai ... Ai 
in T, 1 < j < i, and p > 1 no greater than the arity of Aq, there exists q > 1 
no greater than the arity of Aj, s G N, sequence {{pr,Pr,qr))r<s of members of 
a bound kit for T, and 1 < ko . . . kg < £ such that: 

1. for all r < s, Lpr \= Ak,.; 

2. Aj{p) d:Ako(po); 

3. for all r < s, Ak^{qr) d, Ak^^^{pr-ki) ; 

4- AkAqs) dAo{q). 
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Directly from Definitions 0 0 and ITUl 

Lemma 6. Every strongly admissible program is safe. 

The third condition in Definition [^implies that the set of pairs (T,X) where X 
is a bound kit for T, is computable. With Definition ITDl we conclude that: 

Lemma 7. The set of safe programs is computable. 

Lemma 8. Every safe program is admissible. 

Proof. Let safe program T be given. Let clause Aq ^ A\ ... Ai in T and h : 
Var ^ Ccterm be such that for all n < £, T |= An\h\. Let 1 < j < ihe given. Let 
p > 1 be no greater than the arity of Aj . Then there exists q > 1 no greater than 
the arity of Aq, s G N, sequence {{ipr,Pr, <}r))r<s of members of a bound kit for 
T, and 1 < kg . . . kg < i which satisfy the four conditions expressed in Definition 
rrni Hence (pr ^ Ai^^[h] for all r < s, which with Lemma and the fact that 
T 1= An[h] for all 1 < n < £, implies that Ak^{pr)[h] A Ak.^{qr)[h] for all r < s. 
Moreover, the first condition in Definition |S| implies that Aj{p)[h] A Akg{po)[h], 
Akr.{qr)[h]d:Ak^_^^{pr+i)[h] for all r < s, and Ak^{qs)[h] d: Ao{q)[h]. It follows 
that Aj (p) [h] d ^0 (?) [h] . By the second condition in Definition we infer that 
Aj[h] d ^o[^]- Hence T is admissible. □ 

Directly from Lemma 0 Proposition and Lemmas Q and 0 

Proposition 5. Let be a computable enumeration of all safe programs. 

Then the following conditions are equivalent. 

1. Bounds for the number of clauses and bounds for the number of local variables 
are learnable in ({Ht; | i G N}, (Ti)igjv). 

2. {H-Ti I i G A^} Js learnable in ({Ht; I * G N}, (T'i)igAr). 

7 Examples 

We now give a few examples of safe programs, among which linearly-moded 
programs in the sense of They are safe on the basis of a wide class of pre- 
orders d , among which those given as examples. 

Example 5. The reverse program: 
app( [ ] ,Ys , Ys) ^ 

app( [X I Xs] , Ys , [X I Zs] ) ^ appCXs , Ys , Zs) 
rev([ ] , [ ]) ^ 

rev( [X|Xs] ,Zs) <— rev(Xs,Ys), app(Ys , [X] , Zs) 

{(app(X,Y,Z), 1,3)} is a bound-kit for the reverse program, because for all 
h : Var — > Ccterm, [ ][h]^Ys[/i] (see first clause) and Xs[h]^Zs[/i] implies 
[X|Xs][/i]^ [X|Zs][/i] (see second clause). So rev(Xs,Ys) and app(Ys , [X] , Zs) 
in the fourth clause can be “bounded” by rev(Xs,Zs) and appCZs, [X] ,Zs) 
respectively. We conclude easily that the reverse program is safe. 
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Example 6. The quick-sort program: 
ap( [ ] ,Ys,Ys) ^ 

ap([X|Xs] ,Ys, [XlZs]) <- ap(Xs,Ys,Zs) 
par([ ],H,[ ],[ ]) ^ 

par([X|Xs] ,H, [XlLs] ,Bs) <- X<H, par(Xs,H,Ls,Bs) 
par([X|Xs] ,H,Ls, [XlBs]) ^ X>H, par(Xs,H,Ls,Bs) 

qs([ ],[ ]) ^ 

qs([H|L],S) ^ par(L,H,A,B) , qs(A,Al), qs(B,Bl), ap (A1 , [H I Bl] , S) 

It is easily verified that the set consisting of (ap(X,Y,Z), 1,3), (ap(X,Y,Z),2,3), 
(par (X,H,L,B) , 3, 1) and (par (X,H,L,B) , 4, 1) is a bound-kit for the quick-sort 
program. Hence par(L,H,A,B), qs(A,Al), qs(B,Bl) and ap(Al, [H|B1] ,S) in 
the definition of qs can be “bounded” by par (L,H,L,L), qs(L,S), qs(L,S) and 
ap(S,S,S) respectively. We conclude immediately that quick-sort is a safe 
program. 

Example 7. The exponential program: 
adminusl(s(X) ,0,X) <— 

adminusl (s (X) , s (Y) , s (Z) ) «— adminusl(s(X) ,Y,Z) 

mult(0,Y,0) <— 
mult (s(X) ,0,0) <— 

mult (s (X) , s (Y) , s (Z) ) ^ mult (s (X) , Y,U) , adminusl (s (X) ,U, Z) 
exp (X, 0,1) <— 
exp(0,s(Y) ,0) ^ 

exp(s(X) ,s(Y) ,Z) <— exp(s (X) , Y,U) , mult(s(X) ,U,Z) 

It is easy to verify that {(adminusl (X, Y,Z) , 2, 3), (mult (s (X) , Y, Z) , 2, 3)} is a 
bound-kit for the exponential program. Therefore the atoms mult(s(X) ,Y,U) 
and adminusl (s (X) ,U, Z) (respect. exp(s(X) ,Y,U) and mult (s (X) ,U, Z) ) in 
the definition of mult (respect, exp) can be “bounded” by mult(s(X) ,Y,Z) and 
adminusl (s (X) ,Z,Z) (respect, by exp(s(X) ,Y,Z) and mult (s(X) ,Z,Z)). We 
conclude immediately that exponential is a safe program. 

8 Conclusion 

The results in this paper give a generalized framework for proving convergence 
results for learnability of logic programs from positive data. Clearly, the next 
step is explore more concrete models by introducing resource-boundedness and 
probabilistic data settings in the above model. These issues are topics for our 
future work. We would also like to note that the results presented here can be 
extended to include ordinal bounds on the number of mind changes in the spirit 
of IIUI . This issue will be included in the journal version of the paper. 
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Abstract. In this paper the problem of induction of clausal theories 
through a search space consisting of theories is studied. We order the 
search space by an extension of ^-subsumption for theories, and hnd a 
least generalization and a greatest specialization of theories. A most spe- 
cihc theory is introduced, and we develop a rehnement operator bounded 
by this theory. 



1 Introduction 

In the normal setting of ILP, the goal is to find a hypothesis theory which explains 
a set of examples given some background theory. This is done iteratively such 
that one single clause is found at the time. Each clause is usually found by 
searching some ordering of clauses. When a clause is found, it is added to the 
hypothesis theory, and the examples it explains are removed from the example 
set. 

This approach poses a problem since the clauses may depend on each other. 
So the sequence in which they are found may be critical. A strategy such as 
selecting the clause that explains the most examples or seems the best given 
some measure, is often used. But this does not mean that the best theory is 
found. This is really a greedy strategy and the search is by no means complete. 

One way to overcome this problem would be to a use a search space consisting 
of theories rather than clauses. The best theory will obviously be in this search 
space. A drawback is of course that this search space is much larger, but it has 
the benefit that we can control the search more directly. So even if we cannot 
search it completely, it might be easier to define good heuristics for this search 
space. 

In this paper we study this search space of theories. Section 2 recaptures some 
preliminaries. Section 3 mentions some related work. In section 4, we find a least 
generalization and a greatest specialization and show that this search space has 
no ideal refinement operator. We introduce a most specific theory which a can be 
used to bound the search space in section 5. Finally, in section 6 we construct a 
refinement operator for this search space which is bounded by this most specific 
theory. 

S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 210-221, 1999. 
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2 Preliminaries 

First-order logic We assume that the reader is familiar with basic logic pro- 
gramming concepts and recapture just a few that are necessary for this paper. 

As in [ 7 ], a clausal language C given by an alphabet is the set of all clauses. 
A Horn language % given by an alphabet is the set of all Horn clauses. A (Horn) 
theory is a set hnite of (Horn) clauses. A clause or a theory is function-free if it 
does not contain functions with arity > 1 . Two clauses are standardized apart if 
they have no variables in common. Let T be a theory. Then HB{T) denotes the 
Herbrand base of T, and M{B) denotes the least Herbrand model of T. HBp{T) 
is the subset of HB (T) containing only the ground instances of the predicate 
P. If e is a clause and B a theory then e"*" and e~ designate the head and the 
body of e. e denotes (-ie"*“ A e~)a where cr is a Skolem substitution wrt. B. Let 
if be a theory where E = {ei, . . . , e™} and each e,- is a clause. Then E denotes 
ei V ■ ■ ■ V em = (“’Cl" A ej" )(Ti V • • • V (~'e+ A e“ )(Tm such that for all 1 < i < m, 
(Tj- is a Skolem substitution wrt. B and A for all j i. 

Quasi-orders. A quasi-ordered set {G, R) is a set G ordered by a relation R 
where R is reflexive (V* C G xRx) and transitive (V*, y,z ^ G, xRy and yRz ^ 
xRz). Let (G, >) be a quasi-ordered set and let x,y^G.lfx>y and there is no 
z such that x > z > y, then x is an upward cover of y, and y is a downward cover 
of X. Gsubsumption is a quasi-order that is usually dehned between clauses, but 
has also been dehned between theories [ 1 , 3 ]: 

Definition 1 . (B- subsumption) A clause ci 6 -subsumes a clause C2 (ci A C2) 
*if ciO C C2. Two clauses ci and C2 are 9 -equivalent (c\ ~ C2) iff ci A C2 
and C2 E Cl. A clause ci is 6 -reduced iff there is no proper subset C2 of ci 
(c2 C Cl) such that ci ~ C2. A theory Ti 6 -subsumes a theory T2 (Ti ET2) iff 
Vc2 C TjBci C Ti Cl A C2. Two theories Ti and T2 are 9 -equivalent (Ti ~ iff 
Ti y T2 and T2 ETi. A theory Ti is 9 -reduced iff all the clauses in the theory 
are 9 -reduced and there is no proper subset of T2 of Ti such that Ti ~ T2. 

Partial order. A partially ordered set (G, R) is a set G ordered by a relation 
R where R is refiexive, anti-symmetric (V*, y G, xRy and yRx ^ x = y), and 
transitive. A quasi-order may (G, >) may induce an equivalence-relation Pd such 
that * Pd j/ iff * > j/ and y > x. Let [x] denote the equivalence class of x, i.e., 
[*] = {* I * Pd y}. Then we can define a relation > on the quotient set G/pd 
such that [*]>[j/] iS X > y. Then (G/pd,>) is a partial order, ^equivalence is 
such an equivalence-relation, and Gsubsumption becomes a partial order when 
it is defined on the equivalence classes induced by ~. 

Lattice. If (G, >) is a quasi-ordered set and S' C G. * G G is a qeneralization 
(upper bound) if x > y for all y ^ S and a least qeneralization (least upper 
bound) if z > * for all generalizations z G G of S. Similarly, * G G is a special- 
ization (lower bound) of S if y > x for all y ^ S and a qreatest specialization 
(greatest lower bound) if * > z for all specializations of z G G of S. (G, >) is 
called a lattice if for all x,y ^ G a least generalization and a greatest specializa- 
tion of {x, y} exists. 
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A lattice is usually defined on a partial order, but as in [7], we define it on a 
quasi-order. This does not really matter since if (G, >) is a lattice defined on a 
quasi-order, (G/pd, >) will be a lattice defined on a partial order. 

Refinement operators. Let (G, >) be a quasi-ordered set G. A downward 
refinement operator for (G, >) is a function p{C) C {D \ C > D}, and an 
upward refinement operator for (G, >) is a function S{C) C {D \ D > C}, for 
all G e G. 

A refinement operator may be characterized by a number of properties. These 
are given for a downward operator p below, but can easily be changed to ht an 
upward rehnement operator. 

Definition 2. Let (G, >) be a quasi-ordered set. 

1. The rehnement closure p* {C) for some C ^ G is: 

P°(C) = {G} 

p"{C) = {D I there is an E ^ p"“^(G) such that D E p[E)},n > 1 
p*(G)=p“(G)Upi(G)U--- 

2. A p-chain from C to D is a sequence C = Go,Gi, . . .,G„ = D, such that 
Ci E p{Ci-i) for every 1 < i < n. 

3. p IS locally hnite iff for every C EG, p{G) is finite and computable. 

4 . p IS proper iff for every G E G, p{G) C {D \ G > D}. 

5. p IS complete iff for every G, D E G such that G > D, there is an E E p* (G) 
such that D and E are equivalent in the >-order. 

6. p IS ideal iff it is locally finite, complete, and proper. 

If a rehnement operator is applied on a lattice we may dehne completeness with 
respect to a bottom element. 

Definition 3. Let (G, >) be a lattice with a top element, T , and a bottom ele- 
ment, T. p IS complete wrt. to a bottom element T iff for every G,D eG such 
that G > D > E there is an E E P*{G) such that D and E are equivalent in the 
> -order. 

3 Related work 

Plotkin [ 8 ] and Reynolds [9] both proved that the language of literals forms 
a lattice ordered by either Gsubsumption or entailment. Plotkin proved also 
that a clausal language C ordered by Gsubsumption forms an lattice and gave 
an algorithm to compute least generalization of clauses. He showed that this 
lattice contained inhnite ascending and descending chains. Some examples of 
such chains (taken from [7]) are: 

An infinite ascending chain: c y . . . y e^+i y e^ y ■■■ y 02 y e\, where 
c = {P{xi,X 2 ),P{x 2 , Xi)},dn = {P {yi, y’z) , P {v’z, f/s), • • • , H (t/„ ^ ) } , n > 2, 
c„ = c U n > 3, and eu = 03 * ,k >1 
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An infinite descending chain: C 2 y cs y ... y c„ y c„+i y . . . y c, where 
c = P{xi, xi) and c„ = {P{xi, Xj)\l < i,j < n}, n>2 

Laag and Nienhuys-Cheng [4] used such chains to prove that there is no ideal 
refinement operator for clauses. For any quasi-order the following lemmas hold: 

Lemma 1. [7] Let (G, >) be a quasi-ordered set. If there exists an ideal down- 
ward (upward) refinement operator for (G, >), then every C ^ G has a finite 
complete set of downward (upward) covers. 

Because of the chains, there is a clause that has no finite complete set of down- 
ward covers, and there is a clause that has no upward cover at all. 

Proposition 1. [7] Let C be a clausal language containing a binary predicate 
symbol P. Then c = {P{xi, X 2 ), P{x 2 , xfi} has no finite complete set of down- 
ward covers in (C, y). 

Proposition 2. [7] Let C be a clausal language containing a binary predicate 
symbol P . Then c = {P{xi, xi)} has no upward cover in (C, y) 

Thus there is no ideal downward or upward refinement operator for (C, y). Sim- 
ilar results holds also for Florn clauses. Nevertheless, there are locally finite and 
complete refinement operators which are not proper. One of those will be ex- 
tended in section 6. 

In [5,6], Muggleton introduced the concept of a most specific clause, which 
is also called a bottom clause. This clause is induced from an example e relative 
to a background theory B, and is defined as follows. 

Definition 4. [6] Let B be a Horn theory, e a clause such that F = {B Ue) is 
sati,sfiable (i.e. B ^ e). The bottom clause of e under B is denoted BOT{B , e) 
and IS defined as follows. 

BOT+(B,e) = {a I a e HB(F)\M(F)}, BOT~(B,e) = {^a \ a e M(F)} 
BOT(B, e) = BOT+ (5, e) U BOT~ (B, e) 

The bottom clause is infinite if B and e contains functions, but finite if they are 
function-free. A clause containing functions can be made function-free by flat- 
tening. In Progol, the bottom clause BOT{B, e) is used for bounding the search 
space so only the sub-lattice, consisting of the clauses ^subsuming BOT{B, e), 
is searched. 

4 Properties of 0-subsumption on theories 

Lattice. The least generalization of clauses is important since it can be used for 
generalizing clauses. In the next lemma we find a least generalization of theories. 

Lemma 2. Let S be {Ti,...,T„} where each Ti is a theory (i.e., Ti a finite 
subset C). Then T = is a least generalization (least upper bound) of S. 
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Proof. 

1. T is a generalization of S, i.e., for all Ti in S we have T y Ti. This holds 
since any clause Ci in Ti is also in T. 

2. T is a least generalization of S, i.e., for all 17 if 17 ^ Ti holds for all Ti then 
U y T. Assume that U is an arbitrary theory such that U y Ti for all Ti. 
For each c ^ T there is a Ti and a Ci G Ti such that c = Ci (since T is the 
union of the Tf). For this Ci there is a d G 17 (since U TTf) such that dy Ci. 
So d y c. Thus for each c ^ T there must be at least one d ^ U such that 
dye, and therefore U y T. Since U is an arbitrary theory this must hold 
generally as well. Thus for all U for which U y Ti holds for all Ti, we have 

U yT. 

So T = U"-iTi is a least generalizations of S' = {Ti, . . .,T„}. S may actually 
have several least generalizations, but all them belong to the same equivalence 
class induced by ^-equivalence. 

Example 1. Let T\ = {7^(*i,*2) V P{x 2 ,xi),Q{x,x)} and T 2 = {P{xi,a) V 
P(a, x\), Q(a, a)}. Then T = T 1 UT 2 = {P(xi, X 2 )V P(x 2 , x\), P{x\, a)VP(a, x\), 
Q{x,x),Q{a,a)} is a least generalization of Ti and T 2 . Since Ti ^ TS, Ti is also 
a least generalization of Ti and T 2 . But Ti and T belong to the same equivalence 
class since Ti ~ T. 

We can also hud a greatest specialization of a arbitrary set of theories as the 
next lemma shows. It builds directly on a greatest specialization of clauses given 
by [8, p. 88] and [7]: If ci, . . . , c„ are standardized apart then ci U • • • U c„ is a 
greatest specialization. Using this greatest specialization of clauses the lemma 
follows quite easily. Thus the proof will not be given here. 

Lemma 3. Let S be {Ti,...,T„} where each Ti is a theory. Then a greatest 
specialization (greatest lower bound) ofTi,..., and T„ is 

^ ^ {d, . . . ,c„) e Ti X T 2 X ■ ■ ■ X T„ and c[,. . . ,c'„ 

^ " are variants of ci, . . . ,Cn and standardized apart 

Like the least generalization, the greatest specialization presented here is 
one of several, but all of them belong to the same equivalence classes induce by 
^-equivalence. 

Thus the set of all theories, S, ordered by A has a least upper bound and a 
greatest lower bound for each subset of S. The theory containing only the empty 
clause (i.e., {□}) is the top element in this order since for all T G {□} A T (all 
clauses in T are d-subsumed by □). The bottom element is the empty theory {} 
which contains no clauses, since for all T G T A {} (there are no clauses in {} 
that have to be d-subsumed by a clause in T). Thus (S', A) is a lattice. 

The set of all Horn theories has the same least generalization as the set of 
all theories, but a greatest specialization of the set of all Horn theories must be 
based on a greatest specialization of a set of Horn clauses which is [7]: 

Cl U • • • U c„ if Cl, . . . ,Cn are headless 
ci^ U • • • U c„^ if 8 = mgu(cf , . . . ,cf) 

T if cf , . . . ,cf do not unify 





gsn{ci ,. . . ,c„) 
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_L is the bottom clause in the lattice formed by A Horn language does 

not really have a bottom clause, but _L is an artihcial bottom clause inserted 
into the language in order to complete the lattice. This bottom clause is also 
necessary when considering Horn theories. A greatest specialization of a set of 
Horn theories is the found by replacing ci U • • • U c„ with c'^, . . . , c^) in 

lemma 3. Consequently, the set of all Horn theories order by il-subsumption is a 
lattice. 

Notice that the least generalization of clauses can be used as an inductive 
operator to perform an inductive leap beyond what is already known. This is not 
true for the least generalization of theories. The reason is that theories is such a 
general representation that given a set of examples (e.g. ground atoms) we can 
represent them directly. There is no bias in the representation. Thus we must 
rely on rehnement operators to perform induction. Next, we show that there is 
no ideal rehnement operator for theories. 



Covers. The problem of inhnite ascending chains carry over to theories, but 
we can prove more directly that there is no hnite downward cover for theories 
ordered by ll-subsumption. 

Lemma 4. Let c = * 2 ), ®i)} be a clause m C and S be the set of 

all fimte subsets of C. Then {c} G S has no finite complete set of downward 
covers ordered by A. 

Proof Assume that F = {Ti , . . . , T„} is a hnite complete set of downward covers 
of {c} according to the A-order, where each Ti is hnite. Now let T be the union 
of the Tj ’s. If we can show that T is a hnite complete set of downward covers 
of c, we will have a contradiction with proposition 1. Thus F cannot be hnite 
and complete. Since both the number of Tj ’s and the Tj ’s themselves are hnite, 
T must be hnite. Thus just proving that T is a complete set of covers for c, 
remains. 

For each Ti in F we have that {c} A Ti since T is a set of downward covers of 
{c}. Thus for all Ti and for all e,- ^ Ti c F Ci. Since each e G T is also in a Ti, we 
must have cF e for each e in T. At the same time we have Ti {c} for all Tfis, 
or else they would not be covers of {c}. Thus none of e in the Tfis ll-subsume c. 
So, each member e of T is a proper specialization of c, i.e., for all e in T cF e. 

For each d such that c F d, we have also {c} {d}. Since T is a complete 

set of downward covers of {c}, there must be a T,- G T such that {c} Ti F {d} 
for each d. Ti F {d} means there is an e G Ti such that e F d. c must also 
d-subsume this e since {c} Ti means that c F Ci, for all e,- G Ti. Thus there is 
an e G Ti such that c F e F d. Since T is union of the Tj ’s, it must include e as 
well. Thus for all d such that c F d, we have an e G T such that c F e F d. So T 
is a hnite complete set of covers of c. But as mentioned above, this contradicts 
proposition 1. So there is no hnite complete set of covers for {c}. 

Also, the descending chain for clauses given in section 3 reappears for theories: 
{C 2 } {Cn} A {Cn _l_i} {c} (where c = P{x,x)). The proofs will 
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not be presented here, but it is possible to prove that there is no hnite upward 
cover E of {c} such that {cfe} E, for all > 2. Then it follows that {c} has 
no hnite complete set of covers. 

So if the language contains a predicate or a function with arity > 2, there 
are theories that do not have a hnite and complete set of downward or upward 
covers. Then by lemma 1, there is no ideal downward or upward rehnement 
operator for {S,E) where S is the set of all hnite subsets of C. This holds also if 
S is set of all hnite subsets of 7i. 

5 A most specific theory 

One problem when extending the most specihc clause to theories lies in the 
difference between 6*-subsumption on clauses and 6*-subsumption on theories. 61- 
subsumption on clauses is a generalization of the subset relation between sets of 
literals, but 6*-subsumption on theories is a generalization of the superset relation 
on sets of clauses. We can create a generalization S' of a theory T by adding a 
clause to T that is not 6*-equivalent to any other clause in T. Thus a theory that 
6*-subsumes a bottom theory, can contain a clause that does not 6*-subsume any 
clause in the bottom theory. This means that a bottom theory is not a sufhcient 
condition (as the bottom clause) telling what kind of clauses there can be in a 
theory 6*-subsuming it. The bottom theory is rather a necessary condition telling 
which clauses there must be in a theory 6*-subsuming it. 

The most specihc clause corresponds to the least Herbrand model M (5U{e}) 
of 5 U {e}. Given the prior necessity {{B U {e}) ^ □) and posterior sufhciency 
((5U/iU{e}) 1= □) requirements, it follows that M (5U{e}) cannot be a model of 
h. Thus there must be a substitution 6 such that h'^6 C iIB(5U{e})\M(5U{e}) 
and h~ 9 C M{B U {e}). If we let HB{B U {e})\M (B U {e}) and M{B U {e}) 
be the head and body of the bottom clause, we see that h must 6*-subsume it. If 
we assume that B and e are function-free, this clause will also be hnite since in 
that case M(B U {e}) and HB(B U {e}) are hnite. 

We can extend this approach to theories. Then the requirements are that 
A' = (5 U ^ □ and L = (BUAUA^ |= □ where E = {ei, . . . , e™}, and 
the heads of all the examples e,- are instances of a single learning predicate P 
(i.e., the predicate we want to learn). Now, each model M of K corresponds to a 
distinct bottom clause that a h in H must 6*-subsume. The bottom theory could 
include just one such clause or them all. H must 6*-subsume it in any case, but 
the more of them it contains, the tighter bound it will be. Selecting all of them 
is not practically feasible. So we have selected just some. The choice made here 
is related to the next section and will be made more apparent there. 

Definition 5. Let B be a background Horn theory, E be a set of example Horn 
clauses where the head of each clause is an instance of the learning predicate P. 
Then the most specific theory BOTt{B, E) is defined as: 

BOTt(B,E) = {BOTc{B,e) \ e e E and {B U HBfi^B UE) U {e}) ^ □} 
BOTc{B,e) = BOT+{B,e)\J BOTfi{B,e) 
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BOT+{B,e) = {a I a e HB{BLI¥)\M}, BOTc{B,e) = {^a | a e M} 

where M = M(B U HBp^B UE)U {e}). HBp^BUE) = HB p(B UE)\{e+ a}) 
where e is in E and cr is the Skolem substitution fore such thate = (-ie"*“ Ae“)(T, 
(HBff [B\J E) IS the set of all ground instances of P expect the Skolemized head 
ofe). 

Now, we will show that BOTp[B , E) is a bottom theory such that any H 
satisfying B H \= E must il-subsume it. 

Theorem 1. (Completeness wrt. the bottom theory) Let H be a Horn theory, 
and let B, E, and P be as in definition 5. H satisfies L = (BUHUE) |=D and 
K = {B UE) ^ a and HB{K) = HB{L) only if H E BOTt{B, E) 

Proof Let P = [B U HBff[B U U {e}) where e ^ E, and [B U HBff[B U 
E) U {e}) y=- □. Then M{E) is a model of K since: 

1. E \= B. So M(E) \= B. 

2. E \=e. So M{E) \= e. Since E is disjunction of negated Skolemized clauses 
where e is in this disjunction, M{E) \= E. 

M{E) is an interpretation of L, but M{E) cannot be a model of L. Thus there 
is a h ^ H such that M{E) ^ h. Then there is a substitution 6 such that h~^6 <Z 
HB{K)\M{E) (since HB{K) = HB{L)) and h~ 9 C M{E) which is similar to 
h y BOTc{B, e). This holds for any examples in E. So for all e C if there is a 
h H such that h E BOTc{B, e). This is again the same as H E BOTt{B, E). 

5.1 Relevance 

An ILP system such as FOIL and Progol hnds clauses iteratively by adding a 
clause at the time to a resulting theory. A common feature of these clauses is 
that they explain at least one previously unexplained example given the back- 
ground theory and previously discovered hypothesis clauses. It is not necessary 
that a clause explains any examples directly. The clause can imply some ground 
instance of the learning predicate^ which again imply an example through an- 
other (recursive) clause, provided that this clause has already been found. In 
this case the clause explains the example indirectly. 

Since the search space of theories is quite large, we need to set some restric- 
tions. We will use the restriction that all clauses in a theory must explain some 
examples directly. We call such clauses (theories) relevant. This leads to incom- 
pleteness, since we will no longer consider all possible hypotheses H that satisfy 
BU H \= E. But this restriction is not that severe. If the example set includes at 
least an example for each of the clauses in the theory that we want to learn, we 
will be able to hnd it. Also, it seems natural that the examples should give some 
evidence of which clauses there should be in the hypothesis theory. Relevance is 
dehned as: 

^ Consider examples as just ground atoms. 
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Definition 6. Let H be a Horn theory, and let B, E, and P be as m definition 
5. A clause A clause h in H is relevant iff there is an example e in E such that 
(5Ui?5p®(5u¥)U/jU{e}) ^ □, but {B U HBfi^ {B UE) U {e}) ^ theory 

H IS relevant iff every clause in H is relevant. 

Example 2. Let B = {A[a,b) , A[b, c) , A[b, d) , B[d, c)} , H = {hi , /i 2 , /is}, E = 
{P[a,b),P{a,c),P{d,c)}, hi = P{x,y) ^ A{x,y), h^ = P[x,y) ^ B{x,y), 
and /is = P{x, y) A{x, z) A P{z, y). Then H is a relevant theory since {B U 
HBfi‘^(B UfJ)UhU jej) |= □ holds for hi and P{a, b), h'j and P{d, c), and /is 
and P{a,b) or P{a,c). If P{d,c) were removed from E, /12 would no longer be 
relevant even though it explains P{a, c) indirectly through /is- 

If we apply this restriction to our learning problem, the most specific theory 
becomes more useful since each clause in the hypothesis has to l/-subsume some 
clause in the most specific theory. 

Theorem 2. (Completeness wrt. relevance) Let B, E, H, and P be as in def- 
inition 6. H satisfies \!h (2 H 3e (2 E K = [B \J HBfi^{B U U jej) ^ □ and 
L = [B U HBfi‘^(B U E) U h U jej) |= □ (i.e., H is relevant) and HB{L) = 
HB{K) = HB{B U ¥) only ifMh e i? 3& e BOTt(B, E) hPb. 

Proof For each h in H there is an example e E E such that K is satisfiable 
while L is inconsistent. Then M{K) is a model of K, but not of L since L is 
inconsistent. This means that h must be false in M{K). Thus there must be a 
substitution 0 such that h'^0 C HB{K)\M{K) (since HB{L) = HB{K)) and 
h~9 C M{K). This is exactly the same condition as h A BOTc{B,e) (since 
HB{K) = HB [BCE)) which is in BOTt{B,E). 

Thus for each h in H there is an example e and for this example h A 
BOTc{B, e). So we have \!h (2 H 3b (2 BOTt{B, E) h A b. 

6 Bounded refinement operators 

Having defined that the most specific theory, we are ready to define a refinement 
operator that can be applied on a search space bounded by this theory. We only 
consider downward operators, and find a bounded operator for theories. This 
operator applies a bounded operator for clauses as a sub-operator. Since there 
already exists a bounded downward operator for clauses such as the one in Progol 
[5], we could have used it as a sub-operator. But it relies on mode declarations of 
the each predicate in the language. These mode declarations are a syntactic bias 
that restricts the language, and we wanted rather to consider an unrestricted 
language. So we have chosen to base our operator on Laird’s operator. 

6.1 A bounded downward operator for clauses 

Laird’s refinement operator (denoted p^ as in [7]) is a downward refinement 
operator for (C, A). It is locally finite and complete for a clausal language if the 
language contains only a finite number of constants, functions and predicates 
[4] . We define a bounded version of this refinement operator such that it works 
with a bottom clause. 
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Definition 7. Let c be a clause. Then Pl{c) ts a downward refinement operator 
and is defined as: 

1. For every variable z in c and every n-ary function symbol f in the language, 
c{z/ f[xi, . . . , Xn)} G Pl{c) where xi,...,x„ are distinct variables not ap- 
pearing me. 

2. For every two distinct variables x and z in c, c{z/*} G Pl{c)- 

3. For every n-ary predicate symbol P in the language, both cU {p{xi, . . . , *„)} 
and cU {->p{xi, . . . , Xn)} are in pl(c) where xi, ..., x„ are distinct variables 
not appearing in c. 

Definition 8. Let c be a clause and _L a bottom clause such that c ^ _L, Then 
Pb{c, -L) is downward refinement operator bounded by _L and is defined as: 

Pb{c, ±) = {c G Pl{c) I c ^ 1} 

This refinement operator might be inefheient since it requires il-subsumption 
testing which is NP-complete[2]. We can prove that this bounded rehnement 
operator is locally hnite and complete wrt. a bottom clause. 

Theorem 3. Let C be a clausal language, containing only a finite number of 
constants, function symbols and predicate symbols. Let L be a bottom clause. 
Then pB is downward refinement operator which is locally finite and complete 
wrt. T, 

Proof. First, pB is also locally hnite since pB is locally hnite, and Pb{c) C Pl{c)- 
Second, We will show that if c d ^ T holds for any two clauses c and d then 
any -chain between c and d is also a ps -chain. We already know that pi is 
complete. So for any two clause c and d satisfying c y d F L, there is a p^-chain 
between them. Then there must be a ps-chain as well. So pB is complete wrt. 
T. 

Now, let c and d be two clauses such that c y d F L, and let c = ci , C2, . . . , c„ 
= d' where d' ~ d be a pi-chain between them. For pB (and pb) we have that 
d G Pl{c) implies cF d. This follow directly from the dehnition. Thus we have 
(c =)ci F C 2 y ■ ■ ■ y Cn{= d') F d. Since d ^ T, it follows that all Ci must 9- 
subsume T. pB{ci, -L) contains the subset of PB{ci) where each clause d-subsumes 
T. So Cj_|_i must be in pB{ci, -L), since it is in PB{ci) and d-subsumes T. Therefore 
this chain is also a ps -chain. 

Remark 1. Functions of any arity are allowed in rehnement operator presented 
here. But since a bottom clause is inhnite for a clausal language with functions, 
only bottom clauses for a function-free language can be used in practice. So in 
a practical application, battening must be used to remove functions, and the 
refinement operator will only add functions of arity = 0 (i.e., constants). 

6.2 A bounded downward operator for theories 

Nienhuys-Cheng and Wolf [7, p. 317] defined a complete and locally finite refine- 
ment operator for (N, |=) where S is the set of all finite subsets of C. This operator 




220 H. Midelfart 



used Laird’s operator as a sub-operator. We define a similar, but bounded re- 
finement operator for ^ where ps is applied as a sub-operator. Since we are only 
constructing an operator for 6*-subsumption and not entailment, this operator 
has an advantage. It does not need to save every clause produced in order to 
resolve them with later rehnements. 

Definition 9. Let S be the set of all finite subsets of C, T = {ci,...,c„} a 
theory in S, and _L = {J_i, . . . , ^ S a bottom theory such that T ^ _L. 

Then psT [T, -L) IS a bounded downward refinement operator and is defined as: 

1. i{T\{ci}) \JpB{ci,-L)) e pbt[T,±) for all ± E /(c,) and l<i <n. 

2. if f{ci) C Uj^ificj) then {T\{ci}) E Pbt{T, -L), for 1 <i <n. 

where /(c,) = {_L G _L | c,- ^ _L}, for \ < i < n 

The refinement operator is complete wrt. a bottom theory, BOTt{B, E), if 
the set of theories is restricted so only relevant theories are allowed. 

Theorem 4. Let C be a language, containing only a finite number of constants, 
functions, and predicates. Let E be a finite bottom theory consisting of clauses 
from C. Let TZ be the set of all finite subsets of C such that each theory T E TZ 
satisfies Vc E T3E G _L c ^ T. Then pbt « locally finite and complete 
downwards refinement operator wrt. _L for {TZ, E). 

Proof. Pbt is locally finite. This follows from the facts that p_e is locally finite 
and that there is a finite number of clauses in each theory. Thus item 1 and 2 are 
applied only a finite number of times on a theory and each application creates 
another finite theory. 

To prove the completeness of Pbt, assume that there are two theories S = 
{ci, . . . , c„} and T such that S y T E E. Let T' = {d\, . . . , d^} be a minimal 
subset of T such that T ^ T' and there are no proper subset U of T' such 
that U ^ T' . Then no di E T' 6*-subsume another dj E T' (or a proper subset 
U would exists). Since S y T ^ T' , we must have Ci y dj. For each Ci 
let T/ = {d\, . . . , d{^^} be a subset of T' with all the clauses in T' that are 6- 
subsumed by Ci (i.e., T/ = {d E T' \ Ci y d}). Now we know that each of the 
clauses in T/ 6*-subsumes a clause in _L since Vd E T3E E E d y E. Thus if 
d). E T{ , it must d-subsume some clause Tj G _L, and by the completeness of pB, 
there must be a p^-chain Ci = ei, C 2 , ■ ■ ■ , ej^, ~ d( with length k\. from c,-. 

Let E = max{k[, . . . , Using item 1 of the definition in a breadth- 

first manner, there must be p^y -chain {c,} = R^, ..., R\, ...... where for 

each d E E there is a d-equivalent d' E R^,. We stop unfolding a branch when 
it reaches a clause that is d-equivalent to one in T/. So each R{ contains the 
refinement of all the clauses in Rj-i expect those that are d-equivalent to one 
in Tfi i.e., R) = X] U Y/ where X] = {e £ R)_^ | e ~ d £ TJ and Y/ = {/ | / £ 
Pb{^, -L) and T E /(e) and e E R{_i\X{j'\ . Between each d?)_i and R{ there is 
a Pbt - subchain . . .Q, = R\ where Qk = (Qfe_i\{efe)}) U {/ | / e 

ps(efe, T) and T £ /(e^)} if R]_i\X] = {ei, . . .e;}. 
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Thus for each Ci we have a psT-chain between and R\,- Handling each 
Ci sequentially, we get a psT-chain S = So, ■■■, Si, S 2 , S„ where 

Si = (5i_i\{cj}) U Then we must have T" C S„ where T" = {d[, . . . , d'^} 
and d'l ~ di for all 1 < / < m. 

Now T ^ J. and T ~ r ~ T" . So T" S J. and f{T") = Uc6T"/(c) = -L. 
Thus for each clause d C (S'„\T") we have f{d) C f[T"). So these clauses can 
be deleted by item 2, and there is a -chain from Sn to T" This means that 
there is a complete p^T-chain from S to T" where T ~ T" . So psT is complete. 

7 Conclusion and future work 

In this paper, we studied the search space ordered by il-subsumption on the- 
ories. We found a least generalization and a greatest specialization of theories 
and showed that the search space is a lattice. We proved also that there is no 
ideal rehnement operator for this search space, and introduced a most specihc 
theory for a restricted set of theories, called relevant theories. Finally, a down- 
ward rehnement operator for theories ordered by ll-subsumption was given. It 
was bounded by a most specihc theory, and we proved it complete with some 
restrictions. 

This rehnement operator was not very efficient since it requires ll-subsump- 
tion testing. So a more efficient operator should be developed. Future work should 
also include an actual implementation. 
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Abstract. We present a method for discovering new knowledge from 
structural data which are represented by graphs in the framework of 
inductive logic programming. A graph, or network, is widely used for 
representing relations between various data and expressing a small and 
easily understandable hypothesis. Formal Graph System (FGS) is a kind 
of logic programming system which directly deals with graphs just like 
hrst order terms. By employing refutably inductive inference algorithms 
and graph algorithmic techniques, we are developing a knowledge dis- 
covery system KD-FGS, which acquires knowledge directly from graph 
data by using FGS as a knowledge representation language. 

In this paper we develop a logical foundation of our knowledge discovery 
system. A term tree is a pattern which consists of variables and tree- 
like structures. We give a polynomial-time algorithm for hnding a uniher 
of a term tree and a tree in order to make consistency checks efficiently. 
Moreover we give experimental results on some graph theoretical notions 
with the system. The experiments show that the system is useful for 
hnding new knowledge. 



1 Introduction 

The aim of knowledge discovery is to find a small and easily understandable 
hypothesis explaining given data. Many machine learning and data mining tech- 
nologies for discovering knowledge have been proposed in many Reids. Especially 
Inductive Logic Programming (ILP) techniques have been applied to discover 
knowledge from “real-world” data [4]. A graph is one of the most common ab- 
stract structures and is widely used for representing relations between various 
data. In many “real-world” domains such as vision, pattern recognition and or- 
ganic chemistry, data are naturally represented by graphs. 

Formal Graph System (FGS, [12]) is a kind of logic programming system 
which uses graphs, called term graphs, instead of terms in first-order logic. FGS 
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can represent naturally logical knowledge explaining data represented by graphs. 
When we try to discover new knowledge from given data, we can not assume 
that a given hypothesis space contains a hypothesis explaining given data from 
the beginning. Hence, when we know that the hypothesis is not in the hypothesis 
space, it is necessary to change the hypothesis space to another space. In [9], the 
method of refutably inductive inference is proposed. If a correct hypothesis dose 
not exist in a hypothesis space, we can refute the hypothesis space and change 
it to another one by using this method. Refuting a hypothesis space is a quite 
important suggestion for us. 

With the above motivations, in [8], we implemented a prototype of a knowl- 
edge discovery system KD-FGS (see Fig. 1). As inputs, the system receives posi- 
tive and negative examples of graph data. As an output, the system produces an 
FGS program which is consistent with the positive and negative examples if such 
a hypothesis exists. Otherwise, the system refutes the hypothesis space. KD-FGS 
consists of an FGS interpreter and a refutably inductive inference algorithm of 
FGS programs. The FGS interpreter is used to check whether a hypothesis is 
consistent with the given graph data or not. The refutably inductive inference 
algorithm is a special type of inductive inference algorithm with refutability of 
hypothesis spaces and is based on [9]. When the hypothesis space is refuted, 
KD-FGS chooses another hypothesis space and tries to make a discovery in the 
new hypothesis space. By refuting the hypothesis space, the algorithm gives im- 
portant suggestions to achieve the goal of knowledge discovery. Thus, KD-FGS 
is useful for knowledge discovery from graph data. 

In this paper, we also consider a restricted term graph g, called a term tree, 
such that the term graph obtained by applying any substitution 6* to ^ is a 
tree, where each graph in 6* is a tree. A term tree can represent a tree structure 
which has variables at internal nodes. But we can not represent such a tree 
structure in the standard representation of a Rrst order term. In [1,3], a tree 
pattern was considered, and learning algorithms for tree patterns from queries 
were presented, where a tree pattern has constants at its internal nodes, but 
only its leaves may be variables. Since KD-FGS is a system directly dealing with 
graphs, the running time is long, in general. Especially, KD-FGS must solve the 
subgraph isomorphism problem, which is NP-complete, in the component of the 
FGS interpreter. However, a polynomial-time algorithm solving the subgraph 
isomorphism problem for trees was proposed in [10]. The FGS interpreter must 
find a unifier of an input graph and a term graph. Since there exists no mgu 
(most general unifier) of two term trees in general, we can not apply the standard 
term algorithms to finding a unifier of a term tree and a tree. Then we give a 
polynomial-time algorithm for finding a unifier of a term tree and a tree by using 
graph theoretical techniques. By employing this algorithm, if input data have 
tree structures, KD-FGS may output a hypothesis within a practical time. There 
are many “real-world” data having tree structures [13]. This algorithm enables 
the application of KD-FGS for those data. 

This paper is organized as follows. In Section 2, we introduce FGS as a 
new knowledge representation language for graph data. In Section 3, by giving a 
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KD-FGS 



Output: An FGS Program Fsp as a Hypothesis 




Fig. 1. KD-FGS: a knowledge discovery system from graph data using FGS. 



framework of refutably inductive inference of FGS programs, we develop a logical 
foundation of KD-FGS. In Section 4, we give a polynomial-time algorithm for 
finding a unifier of a term tree and a tree. In Section 5, we give some examples 
for graph theoretical notions to our system in order to show the usefulness of 
our system. 

2 FGS as a New Knowledge Representation Language 

Formal Graph System (FGS, [12]) is a kind of logic programming system which 
directly deals with graphs just like hrst order terms. In [11, 12], we have shown 
that a class of graphs is generated by a hyperedge replacement grammar (HRG) 
[5] if and only if it is dehned by an FGS of a special form called a regular 
FGS, and that for a node-label controlled graph grammar (NLG grammar) G 
introduced in [6], there exists an FGS F such that the language generated by 
G can be dehnable by F . These show that FGS is more powerful than HRG or 
NLG grammar. 

Let S and A be Rnite alphabets, and let X be an alphabet, whose element is 
called a variable label. Assume that (SU A)nX = 0. A term graph g = (V, E , F[) 
consists of a vertex set V , an edge set E and a multi-set F[ where each element 
is a list of distinct vertices in V and is called a variable. And a term graph g 
has a vertex labeling cpg : V ^ E, an edge labeling ipg : E ^ A and a variable 
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Fig. 2. Term graphs g and g9 obtained by applying a substitution 9 = {x \ = 
[gi,{vi,V 2 )],y := [g 2 , [wi , W 2 )]} to g. 



labeling Xg : H ^ X . A term graph g = (V,E,H) is called ground and simply 
denoted hy g = (V, E) if H = 0. For example, a term graph g = (V,E,H) is 
shown in Fig. 2, where V = {mi, M 2 }, E = ^, H = {ei = (mi, M 2 ), 62 = (mi, M 2 )}, 
ipg(ui) = s, cpg(u 2 ) = t, Xg(ei) = X, and A^(e 2 ) = y. A variable is represented 
by a box with lines to its elements and the order of its elements is indicated by 
the numbers at these lines. An atom is an expression of the form p(gi, . . . , g„), 
where p is a predicate symbol with arity n and gi, . . . , g„ are term graphs. Let 
A, Bi, . . . , Bfn be atoms with m > 0. Then, a graph rewriting rule is a clause of 
the form A ^ Bi, . . . , Bm- An FGS program is a Rnite set of graph rewriting 
rules. For example, the FGS program Esp in Fig. 1 generates the family of all 
two-terminal series parallel (TTSP) graphs. 

Let p be a term graph and cr be a list of distinct vertices in g. We call the 
form X := [p,(t] a binding for a variable label x ^ X. A substitution 6 * is a finite 
collection of bindings {xi := [g\, ai], ...,*„ := [p„, a„]}, where Xi’s are mutually 
distinct variable labels in X and each Pi (1 < * < n) has no variable labeled with 
an element in {xi, ...,*„}. For a set or a list S, the number of elements in S is 
denoted by I^I. In the same way as logic programming system, we obtain a new 
term graph / by applying a substitution 6 = {xi := [pi, (Ti], • • • , := [pn, CTn]} 

to a term graph p = (V,E,H) in the following way. For each binding Xi := 
[Pi, (’’i] & 0 (1 < i < n) in parallel, we attach pi to p by removing the all variables 
ti, ■ ■ ■ ,tk labeled with Xi from H , and by identifying the m-th element of tj 
and the m-th element cr™ of cTj for each f < j < k and each 1 < m < |tj| = |(Tj|, 
respectively. We remark that the label of each vertex t™ of p is used for the 
resulting term graph which is denoted by g6. Namely, the label of cr™ is ignored 
in g6. In Fig. 2, for example, we draw the term graph g9 which is obtained by 
applying a substitution 9 = {x := [pi, (mi, M 2 )], p := [p 2 , (m;i, im 2 )]} to the term 
graph p. A unifier of two term graphs pi and p 2 is a substitution 9 such that gi9 
and P 26 I are isomorphic. In general, there exists no mgu (most general unifier) of 
two term graphs. Therefore, in FGS a derivation is based on an enumeration of 
unifiers and only ground goal is considered in this paper. A graph rewriting rule 
C is provable from an FGS program T if G is obtained from E by finitely many 
applications of graph rewriting rules and modus ponens. An FGS interpreter as 
a component of KD-FGS is used to check whether a hypothesis, which is an FGS 
program, is consistent with the given graph data or not. 
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3 Refutably Inductive Inference of FGS Programs 

In this section we introduce refutably inductive inference of FGS programs. And 
we give two interesting hypothesis spaces of FGS programs, weakly reducing and 
size-bounded FGS programs, which are refutably inferable. Moreover, we present 
refutably inductive inference algorithms for the hypothesis spaces. We give our 
framework of refutably inductive inference of FGS programs according to [2,9, 
14]. Mukouchi and Arikawa [9] originated a computational learning theory of 
machine discovery from facts. They showed that refutably inductive inference 
is essential in machine discovery from facts and the sufhciently large hypothesis 
spaces for language learning are refutably inferable. 

We give our hypothesis spaces of FGS programs. Let g = (V, E, H) be a term 
graph. Then we denote the size of g by |^| and dehne |^| = \V\ + \E\ -\- \H\. For 
example, |^| = \V\ + \E\ + \H\ = 24-0-1-2 = 4 for the term graph g = (V, E, H) in 

Fig. 2. For an atom p(gi, ...,g„),we dehne \\p(gi, . . . ,6f„)|| = |6fi|-| |-| 5 f„|. An 

erasing binding is a binding x := [g, a] such that g consists of all vertices in a, 
no edge and no variable. An erasing substitution is a substitution which contains 
an erasing binding. In this paper, we disallow an erasing substitution. Then 
Ib^ll ^ I Iff 1 1 for any term graph g and any substitution 6 (Size Non-decreasing 
Property). A graph rewriting rule A ^ Bi, . . . , Bm is said to be weakly redueing 
(resp., size-bounded) if ||Aff|| > ||5iff|| for any i = I, . . . , m and any substitution 
0 (resp., ||Aff|| > ||5iff|| 4- • • • 4- ||5m^^|| for any substitution 0). An FGS program 
E is weakly-redueing (resp., size-bounded) if every graph rewriting rule in E 
is weakly reducing (resp., size-bounded). A size-bounded FGS program is also 
weakly reducing. For example, the FGS program Esp in Fig. 1 is weakly reducing 
but not size-bounded. Let g = (V,E,H) be a term graph. For a variable label 
X ^ X, the number of variables in H labeled with x is denoted by o(x,g). 
For example, o(x,g) = 1 and o(y,g) = 1 for the term graph g = (V,E,H) 
in Fig. 2. For an atom p(gi, . . . , g„) and a variable label * G A, we dehne 
o{x,p{gi, . . .,gn)) = o{x,gi) 4 h o{x,gn). 

We consider the two properties of hypothesis spaces for machine discovery 
from facts. Firstly, the hypothesis space for machine discovery must be recur- 
sively enumerable. Secondly, whether a hypothesis is consistent with examples 
or not must be recursively decidable. The following Lemma 1 and 2 show that 
our target hypothesis spaces have the hrst and second properties, respectively. 
The proofs of Lemma 1 and 2 are based on [2, 14]. In case a hypothesis space 
dose not have Size Non-decreasing Property, Lemma 1 does not hold. The set 
of all ground atoms with ground term graphs as arguments is called the Her- 
brand base and denoted by HB. For an FGS program E, Mp denotes the least 
Herbrand model of E. 

Lemma 1. A graph rewriting rule A ^ Bi, . . . , Bm is weakly redueing (resp., 
size-bounded) if and only if ||A|| > ||5i|| and o(x,A) > o(x,Bi) for any i = 
1, . . . , m and any variable label x ( resp., ||A|| > ||5i ||4-- • • + ||5m|| o(x, A) > 
o(x, 5i) 4- • • • 4- o(x, Bm) for any variable label x ). 
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Lemma 2. Let F be a weakly reduemg or size-bounded FGS program. Fhen the 
least Herbrand model Mp of F is a reeursively deeidable set. 

We explain the refutably inductive inference of FGS programs. Let iT be a fi- 
nite set of predicate symbols. For an atom A, pred(A) denotes the predicate sym- 
bol of A. For a set LIq C II and a set S of atoms, S denotes the set of all atoms 
in S whose predicate symbols are in IIo. That is S |iJo= {A G S \ pred(A) G iTo}- 
A predieate-restrieted eomplete presentation of a set I C H B w.r.t. IIo C iT is 
an inhnite sequence (Ai,ti), (Aojto), ■ ■ ■ of elements in H B xj-f,— } such 
that {Ai I ti = i > 1} = I \jj^ and {A,- | F = -, i > 1} = HB \n„ \I\no- A 
refutably induetive inferenee algorithm (RIIA) is a special type of algorithm that 
receives a predicate-restricted complete presentation as an input. An RIIA A is 
said to refute a hypothesis space, if A produces the sign “refute” as an output 
and stops. An RIIA either produces inhnitely many FGS programs as outputs 
or refutes a hypothesis space. For an RIIA A and a presentation 6, A(6[n]) de- 
notes the last output produced by A which is successively presented the Rrst n 
elements in 6. An RIIA A is said to eonverge to an FGS program F for a presen- 
tation 6, if there is a positive integer mo such that for any m > mo, A(6[m]) is 
defined and equal to F. Let BS be a hypothesis space of FGS programs. For an 
FGS program F G BS and a predicate-restricted complete presentation 6 of Mp 
w.r.t. Bo C n , an RIIA A is said to be inferthe FGS program F w.r.t. BS in the 
limit from 6, if A converges to an FGS program F' G BS with Mpi |ijp= Mp 
for 6. 

A hypothesis space BS is said to be theoretieal-term-freely and refutably 
inferable from eomplete data, if for any nonempty finite subset IIo of II , there 
is an RIIA A which satisfies the following condition: For any set I C HB and 
any predicate-restricted complete presentation 6 of I w.r.t. Ho, (i) if there is an 
FGS program F G BS such that Mp |ij„= I |ij„, then A infers F w.r.t. BS in 
the limit from 6, (ii) otherwise A refutes the hypothesis space BS from 6. 

Theoretical terms are supllementary predicates that are necessary for defin- 
ing some goal predicates. In the above definition, the phrase “theoretical-term- 
freely inferable” means that using only facts on the goal predicates an RIIA 
can generates some suppllementary predicates. (resp., 5 ^[<"]) denotes 

the set of all weakly reducing (resp., size-bounded) FGS programs with at most 
n graph rewriting rules. There are many FGS programs which have the same 
least Herbrand model. We can assume a canonical form of such FGS programs 
by fixing predicate symbols in II\IIo and variable labels. CWTZ^"^\lIo] de- 
notes the set of all such canonical weakly reducing FGS programs with just 
m graph rewriting rules. We define C>V7?.f™^[iTo](s) = {T G [iTo] | 

the head’s size of each rule of F is not greater than s}. The proof of Theorem 1 
is based on [9]. 

Theorem 1. For any n > 1, the hypothesis spaee (resp., ) of 

all weakly reduemg (resp., size-bounded) FGS programs with at most n graph 
rewriting rules has infinitely many hypotheses. And (resp., IS 

theoretzcal-term-freely and refutably inferable from complete data. 
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procedure RIIA_WR(integer n, set of predicate symbols Uo C U); 

begin 

T :=9; F := 0; 
read_store(T, F); 
while T = 0 do begin 

output the empty FGS program; 
read_store(T, F); 
end; 

To := T; To := T; 

for m = 1 to ra do begin 

Sm ■■= max{||d|| I A G T„_i}; 

recursively generate [I7o](sm), and set it to S; 

for each T G 5 do 

while (T, T) is consistent with Mr do begin 
output T ; 
read_store(T, T); 

end; 

T — T- T — T- 

end; 

output “refute” and stop; 
end; 

procedure read_store(T, T); 

begin 

read the next fact {w,t)] 

if t =' +' then T := T U {rc} else F := F U {rc}; 

end. 



Fig. 3. R11A_WR: a refutably inductive inference algorithm for the hypothesis space 
yYj^<n] weakly reducing FGS programs with at most n graph rewriting rules. 



Proof. (Sketch of proof) We feed a predicate-restricted complete presentation of 
a set I C HB w.r.t. Bq to the procedure RIIA_WR in Fig. 3. (i) In case there is 
an FGS program T G such that Mp |iio= I \no- It follows by Size Non- 

decreasing Property that a graph rewriting rule whose head has greater size than 
a ground atom A is not used to derive the atom A. Thus, in the procedure, for 
any 0 < m < n, if Tm and Fm are dehned, then M(r) is not consistent with 
Tm and T)„ for any F G CW7?.f™^[iTo]. Therefore T„ and F„ are never dehned 
and the procedure never terminates the Rrst or second while-loop, (ii) Otherwise. 
For any 1 < m < n, all FGS programs in are discarded. 

By simple enumeration of hypotheses, the hypothesis spaces and 

are inferable but not refutably inferable. If the number of graph rewriting 
rules is not bounded by a constant, then these hypothesis spaces are not refutably 
inferable. We can construct a machine discovery system for a refutably inferable 
hypothesis space. Thus Theorem 1 gives a theoretical foundation of KD-FGS. 
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procedure Unification(regular term tree t\, tree T2); 

begin 

Let Ti be one of leaves of ti ; 

Construct the set of all labeling rules ; 
foreach leaf T2 of T2 do begin 

Label each leaf of T2 except T2 with the set of all leaves of Ti except r \ ; 
while there exists a vertex v of T2 

such that V is not labeled and all children of v are labeled 
do Labeling(r, ); 

if the label of T2 includes r\ then ti and T2 are unihable and exit 
end; 

ti and T2 are not unihable 

end. 



Fig. 4 . Unihcation: an algorithm for deciding whether ti and T2 are unihable or not. 



4 An Efficient Algorithm for Finding a Unifier of a Term 
Tree and a Tree 

In this section, we give a polynomial-time algorithm for hnding a uniher of a 
term tree and a tree in order to achieve speedup of KD-FGS. 

A term graph g is called a term tree if each variable in 3 is a list of two 
distinct vertices and, for any substitution 6 = {xi := [g\, ai], ■ ■ ■ ,x„ := [g-n, Cn]} 
such that each term graph gi is a tree, g9 is also a tree. A term tree g is called 
regular if each variable label in g occurs exactly once [7]. For example, a term 
tree g = ({r, s, t, m, r>, w}, {{r, s}, {«, r>}}, {(s, t), (s, m), (m, w)}) is shown in Fig. 
6. As stated in the section 2, in general, there exists no mgu of two regular 
term trees. Therefore, even if the input data for KD-FGS is restricted to trees, 
a derivation in FGS is based on an enumeration of unihers and only ground 
goal is considered. From a simple observation we can show that the FGS inter- 
preter must solve the subgraph isomorphism problem, which is NP-complete. 
For certain special subclasses of graphs, the subgraph isomorphism problem is 
efhciently solvable [10]. But we should note that even if a subclass of graphs 
has an efhcient algorithm for the subgraph isomorphism problem, we can not 
construct a unihcation algorithm straightforwardly from the algorithm. 

In this section, we assume that a tree which is an input to our unihcation 
algorithm is an unrooted tree without a vertex label and an edge label, since we 
can easily construct a unihcation algorithm for a tree having a vertex label and 
an edge label. Let ti = (U, E\, Hi) and T2 = (U, E 2 ) be a regular term tree and 
a tree, respectively. Then, we give the algorithm Unihcation (Fig. 4) for hnding 
a uniher of a regular term tree and a tree. First we specify one of leaves of t\. Let 
the leaf be ri. We dehne the rooted tree T\ as T\ = (U, EiU{{ui, U'j} \ («i, M2) G 
Hi or (m 2. Ml) G Hi}) with the root ri. For a vertex u G U, let Wi, • • • , wj, be 
all children of u in Ti such that each {m, Wi} is an edge in G for i = I, . . . ,k and 
let Wk+i, • • • , Wm be all children of u in Ti such that either (m, Wi) or (wi, u) is 
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procedure Labeling(vertex r G V2, set of labeling rules R); 
begin 
L := 0 ; 

Let d be the number of children of v and Li, ■ ■ ■ , Ld be labels of the children; 
foreach u ^ wi, ■■■ ,Wd in R do begin 

Let E := {{wt, ij} | Wt G Lj{l < * < d, 1 < j < d)} 
if there is a perfect matching 

for the bipartite graph ({rci , • • • , rcd}, {Li , • • • , id}, E) 
then i := i U {«} 
end; 

foreach n w\, ■ ■ ■ , Wk, (wk-\-i), • • • , (wm) in R with m < d do begin 
Let El := {{wt, ij j | Wt G Lj (1 < * < 1 < J < d)|, 

E2 '■= {{w's, ij } I G ij or {wt) G ij (fc + 1 < * < 1 < j < d)| and 

if for the bipartite graph ({rci , • • • , Wm}, {ii , • • • , id), E\ U E2) 

there is a maximum matching which contains all vertices wi, ■ ■ ■ , Wm 
then i := i U {«} 
end; 

foreach (w) <J= (w) in R do begin 

if there is a set among Li, ■ ■ ■ , Ld which includes w or (rc) then 
i := i U {(rc)| 

end; 

Label v with i 

end. 



Fig. 5. Labeling: a procedure for labeling a vertex in T 2 with a set of vertices in ti. 



a variable in ti for i = k + I, . . . , m. We let v be the parent of u in Ti if u is 
not a root of Ti. We define labeling rules for u as follows: If there is no variable 
which has m as a its element, i.e. k = m and both (v,u) and (u,v) are not 
variables in ti, then we simply add the following rule to the set of labeling rules: 

U ^ Wi,W2, ■ ■ ■ ,Wm- 

If k = m but either (v, u) or {u, r>) is a variable in t\, we add the following rule: 



U ^ Wi,W 2 , - ■ ■ ,Wm- 
Otherwise, we add the following rules: 

U ^ Wl, - ■ ■ ,Wk, (wfc + l), • • • , {Wm) 

and for k -\- 1 < i < m, 

(wi) (wi). 

Let Rr-^ be the set of all labeling rules obtained by applying the above process 
to all vertices in Vi. We specify one of leaves r2 of T2 and consider T2 as the 
rooted tree with root r2- Then, we label all vertices of T2 with sets of vertices of 
Ti using the procedure Labeling (Fig. 5 ). First we label each leaf of T2 except 
f2 with the set of all leaves of Ti except ri. For each vertex u in T2 such that 
u itself is not labeled yet but all children of u have been already labeled, we 
repeat the procedure Labeling until r2 is labeled. After the procedure Labeling 
for a vertex v ^ V2 terminates , if v has u as an element of the label of r>, it 
shows that v possibly corresponds to m. If has (u) as an element of the label 
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Term Tree g 



r *— s 
s 

a <= », (w) 
(f) -4= (1) 
(a) -t= (a) 
(w) <= (a;) 



Labeling Rules Rr 




{t,v,w} {t.v.w} {/,v,w} 



TreeT 



Fig. 6. An example; the labeling rule constructed from a term tree g and the labels of 
a tree T after the Unification algorithm terminates. 



of V, it shows that v possibly corresponds to w or v has a descendant which 
possibly corresponds to u. In the procedure Labeling, a labeling rule of the form 
« mi, u; 2 , • • • , can be applied to t; G 1^ only when v has exactly m children 
Cl , C 2 , • • • , Cm such that each child c,- has wi^ as an element of the label of c< for 
i = 1, . . . ,m, where • • • , = {mi, m 2 , • • • ,mm}- On the other hand, 

Wk,{wk+i), , (wm) can be applied to u G when v has at least 
m children ci, • • • ,Ci,, c^+i, • • • ,Cm such that for i = l,...,fc, c,- has W(. as an 
element of the label of c,- where ■ ■ ,wt^} = {mi, m 2 , • • • ,mi,} and for 

f = ik + 1, . . . , m, Cj has wt- or (mrj as an element of the label of Cj where 
{tcr„+, = {tcfc+i, • • • , mm}. The rules of the form (m) <= (m) are used 

to define the descendant relation. If r 2 is labeled with a set including ri, the 
Unification algorithm reports the fact that there is a unifier of ti and T 2 , and 
terminates. Otherwise, the Unification algorithm applies the above process to 
the other leaves of T 2 . In Fig. 6, for example, we give the labeling rules Rr 
constructing in the Unification algorithm and show the label assigned to each 
vertex of T in the Labeling procedure when the term tree g and the tree T shown 
in Fig. 6 axe given as inputs. 

If the algorithm declares that ti and T 2 are unifiable, we can easily find a 
unifier from labels of T 2 . Since the number of vertices contained in each label is 
0(|Vi|), we show the following theorem: 
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Table 1. Experimental results on the KD-FGS system. 



No. 


Examples 


Hypothesis Space 


Result 


1 


TTSP graph 


weakly reducing, ^atom< 2, 7 ^rule< 2 


refute 


2 


weakly reducing, ^atom< 2, 7 ^rule< 3 


infer 


3 


size-bounded, ^atom< 6, 7 ^rule< 2 


refute 


4 


size-bounded, ^atom< 6, 7 ^rule< 3 


refute 


5 


undirected tree 


weakly reducing, ^atom< 1, 7 ^rule< 2 


refute 


6 


weakly reducing, ^atom< 1, 7 ^rule< 3 


infer 


7 


size-bounded, ^atom< 6, 7 ^rule< 2 


refute 


8 


size-bounded, ^atom< 6, 7 ^rule< 3 


infer 



Theorem 2. A unifier of a regular term tree and a tree ean be found in poly- 
nomial time. 

5 Experimental Results: Obtaining Some New Knowledge 
about Graph Theoretical Notions 

In order to show that the KD-FGS system is useful for knowledge discovery 
from graph data, we have preparatory experiments of running the system (see 
Table 1). We give examples for graph theoretical notions to the system and obtain 
some new knowledge about representability in FGS programs. For example, in 
Exp. 2 and 4, input data are positive and negative examples of TTSP graphs 
(see Fig. 1). In Exp. 2 (resp., 4), the hypothesis space C 2 (resp., C 4 ) is the set of 
all restricted weakly reducing (resp., size-bounded) FGS programs with at most 
2 (resp., 6 ) atoms in each body and at most 3 (resp., 3) rules in each program. 
After the system receives some positive and negative examples, it infers a correct 
FGS program in C 2 for TTSP graphs in Exp. 2 (resp., it refutes C 4 in Exp. 4). No 
one knows whether there exists a size-bounded FGS program for TTSP graphs. 
So we have interests in the experiment of Rnding such an FGS program. The 
new results of inferring an FGS program or refuting a hypothesis space are new 
knowledge about graph theoretical notions. Thus, we confirm that the system is 
useful for knowledge discovery from graph data. 

6 Concluding Remarks 

We have given a logical foundation for discovering new knowledge from graph 
data by employing a refutably inductive inference algorithm, which is one of 
ILP methods. And we have presented a polynomial-time algorithm for finding 
a unifier of a term tree and a tree. This algorithm leads us to discover new 
knowledge from “real-world” data having tree structures. 

In order to apply our system to huge “real-world” data, we must achieve 
practical speedup of the KD-FGS system. We are implementing another FGS 
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interpreter, which is based on a bottom-up theorem proving method, in a parallel 
logic programming language KLIC. 
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Abstract. Inductive Logic Programming (ILP) involves constructing an 
hypothesis H on the basis of background knowledge B and training ex- 
amples E. An independent test set is used to evaluate the accuracy of H. 
This paper concerns an alternative approach called Analogical Predic- 
tion (AP). AP takes B, E and then for each test example {x, y) forms an 
hypothesis from B, E, x. Evaluation of AP is based on estimating the 
probability that Hx{x) = y for a randomly chosen {x,y}. AP has been 
implemented within CProgol4.4. Experiments in the paper show that on 
English past tense data AP has significantly higher predictive accuracy 
on this data than both previously reported results and CProgol in induc- 
tive mode. However, on KRK illegal AP does not outperform CProgol 
in inductive mode. We conjecture that AP has advantages for domains 
in which a large proportion of the examples must be treated as excep- 
tions with respect to the hypothesis vocabulary. The relationship of AP 
to analogy and instance-based learning is discussed. Limitations of the 
given implementation of AP are discussed and improvements suggested. 



1 Introduction 

1.1 Analogical prediction (AP) 

Suppose that you are trying to make taxonomic predictions about animals. You 
might already have seen various animals and know some of their properties. 
Now you meet a platypus. You could try and predict whether the platypus was 
a mammal, fish, reptile or bird by forming analogies between the platypus and 
other animals for which you already know the classifications. Thus you could 
reason that a platypus is like other mammals since it suckles its young. In doing 
so you are making an assumption which could be represented as the following 
clause. 

class (A, mammal) has_milk(A) . 
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Fig. 1. Comparison of AP, ILP and IBL. Instances x are pairs {u, v) from {1, 7} x 
{1, 8}. Each X can have a classification y G {+,—}• The test instance x to be classified 
is denoted by The rounded box in each case defines the extension of the hypothesis 
H. Below each box the corresponding prediction for x is given. 



It might be difficult to find a consistent assumption similar to the above which 
allowed a platypus to be predicted as being a fish or a reptile. However, you 
could reason that a platypus is similar to various birds you have encountered 
since it is both warm blooded and lays eggs. Again this would be represented as 
follows. 

class (A, bird) homeothermic(A) , has_eggs(A). 

Note that the hypotheses above are related to a particular test instance, the 
platypus, for which the class value (mammal, bird, etc.) is to be predicted. We 
will call this form of reasoning Analogical Prediction (AP). 



1.2 AP, induction and instance-based learning 

In the above AP is given a test instance x, a training set E and background 
knowledge B. It then constructs an hypotheses Hx which not only covers some 
of the training set but also predicts the class y of x. This can be contrasted with 
the normal semantics of ILP [10], in which hypotheses H are constructed on the 
basis of B and E alone. In this case x is presented as part of the test procedure 
after E[ has been constructed. 

AP is in some ways more similar to Instance-Based Learning (IBL) (see for 
example [3]), in which the class y would be attributed to x on the basis of its 
proximity to various elements of E. However, in the case of IBL, instead of 
constructing El, a similarity measure is used to determine proximity. 

Figure 1 illustrates the differences in prediction between AP, standard ILP 
and IBL on a 2D binary classification problem. The test concept, ‘insign’, is 
actually a picture of the symbol ‘h’ made out of -b’s. If AP is restricted to 
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making a single clause maximally general hypothesis, it would predicts x to be 
positive based on the following. 

insign(U,V) 3=<U=<6, 4=<V=<5. 

Assuming the closed world assumption is used for prediction, normal ILP will 
predict x to be negative based on the following hypothesis (note the exception 
at (5,4)). 

insign(U,V) 3=<U=<4, 2=<V=<7. 

Finally IBL will predict x to be negative based on the fact that 5/6 of the 
surrounding instances are negative. Note that IBL’s implicit hypothesis in this 
case could be denoted by the following denial. 

insign(U,V) , near(U,V,6,5) . 

The background predicate near// in the above encodes the notion of ‘nearness’ 
used in a k-nearest neighbour type algorithm. 



1.3 Motivation 

AP can be viewed as a half-way house between IBL and ILP. IBL has a number 
of advantages over ILP. These include ease of updating the knowledge-base and 
the fact that theory revision is unnecessary after the addition of new examples. 
AP shares both these advantages with IBL. On the other hand IBL has a number 
of disadvantages with respect to ILP. Notably, IBL predictions lack explanation, 
and there is a need to define a metric to describe similarity between instances. 
Generally, similarity metrics are hard to justify, even when they can be shown 
to have desirable properties (eg. [5,11,12]). In comparison AP predictions are 
directly associated with an explicit hypothesis, which provides explanation. Also 
AP does not require a similarity measure since predictions are made on the basis 
of the hypothesis. 

This paper has the following structure. A formal framework for AP is pro- 
vided in Section 2. Section 3 describes an implementation of AP within CPro- 
gol4.4 (ftp://ftp.cs.york.ac.Uk/pub/ML_GROUP/progol4.4). This implementa- 
tion is restricted to the special case of binary classification. Experiments using 
this implementation of AP are described in Section 4. On the standard En- 
glish past tense data set [7] AP has higher predictive accuracy than FOIDL, 
FOIL and GProgol in inductive mode. By contrast, on KRK illegal AP performs 
slightly worse than GProgol in inductive mode. In the discussion (Section 5) 
we conjecture that AP has advantages for domains in which a large proportion 
of the examples must be treated as exceptions with respect to the hypothesis 
vocabulary. We also compare AP to analogical reasoning. The results are sum- 
marised in Section 6, and further improvements in the existing implementation 
are suggested. 
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Aleave{B ,E) 

Let AP=Ap=aP=ap=0 
For each e = {x, y) in E 

(a) Construct Eb,x 
E' ~E\e 

(b) Using E' find most compressive Hx ^ Eb,x 
if j/ = Hx;{x) then 

if y = True then AP ;= AP + 1 
else ap ;= ap + 1 
else 

if y = True then Ap Ap + 1 
else aP := aP + 1 

Print-contingency-table(AP,Ap,aP,ap) 



Fig. 2. Aleave algorithm. Algorithms from CProgol4.1 are used to (a) construct the 
bottom clause and (b) search the refinement graph. 



2 Definitions 

We assume denumerable sets X, Y representing the instance and prediction 
spaces respectively and a probability distribution V on X. The target theory 
is a function f : X ^ Y. An AP learning algorithm L takes background knowl- 
edge B together with a set of training examples E C {{x' , f{x')) : x' G A}. For 
any given B, E and test instance x G X the output of L is an hypothesised 
function Hx- Error is now defined as follows. 

error{L, B, E) = Prxev[hx{x) f{x)] (1) 



3 Implementation 

AP has been implemented as a built-in predicate aleave in CProgol4.4 (avail- 
able from ftp://ftp.cs.york.ac.Uk/pub/ML_GROUP/progol4.4). The algorithm, 
shown in Figure 3, carries out a leave-one-out procedure which estimates AP 
error as defined in Equation (1). In terms of Section 2 each left out example e 
is viewed as a pair {x, y) where a: is a ground atom and y = True if e is positive 
and y = False if e is negative. 

AP error (see Equation 1) is estimated using the counters AP, Ap, aP and ap 
(‘a’ and ‘p’ stand for actual and predicted, and capitalisation/non-capitalisation 
stands for the value being True/False). For each example e = {x,y) left out, a 
bottom clause Yb,x is constructed which predicts y := True. A refinement graph 
search of the type described in [8] is carried out to find a maximally compressive 
single clause which subsumes Yb,x- In doing so compression is computed 
relative to E\e. If no compressive clause is found then the prediction is False. 
Otherwise it is True. 
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English past tense 


KRK illegality 


past([w,o,r,r,y],[w,o,r,r,i,e,dj). 
past([c,l,u,t,c,h],[c,l,u,t,c,h,e,d]). 
past([w,h,i,z],[w,h,i,z,z,e,d]). 
past ( [g,r,i,n,d] , [g,r,o,u,n,d]) . 


illegal(3,5,6,7,6,2). 

illegal(3,6,7,6,7,4). 

:- illegal(2, 5,5,2, 4,1), 
:- illegal(5,7,l,2,0,0). 



Fig. 3. Form of examples for both domains 



English past tense 


KRK illegality 


past(A,B) split(A,C,[r,r,yj), split(B,C,[r,r,i,e,d]).| 


illegal(A,B,A,B,.,_). 

illegal(A,B^ ^ ,C,D) adj(A,C), adj(B,D) 
illegal(A,_,B,_,B,_) not A=B. 
illegalL,A,B,C,B,D) A<C, A<D. 



Fig. 4. Form of hypothesised clauses 



The procedure Print-contingency-table(AP,Ap,aP,ap) prints a two-by-two ta- 
ble of the 4 values together with the accuracy estimate, standard error of esti- 
mation and probability. 

4 Experiments 

The experiments were aimed at determining whether AP could provide increased 
predictive accuracy over other ILP algorithms. Two standard ILP data sets were 
chosen for comparison (described in Section 4.2 below). 



4.1 Experimental hypotheses 

The following null hypotheses were tested in the first and second experiments 
respectively. 

Null hypothesis 1. AP does not have higher predictive accuracy than any 
other ILP system on any standard data set. 

Null hypothesis 2. AP has higher predictive accuracy than any other ILP 
system on all standard data sets. 

Note that hypothesis 1 is not the negation of hypothesis 2. If both are rejected 
then it means simply that AP is better for some domains but not others. 



4.2 Materials 

The following data sets were used for testing the experimental hypotheses. 

English past tense. This is described in [15,6,7]). The available example set 
Epast has size 1390. 
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KRK illegality. This was originally described in [9]. The total instance space 
size is 8® = 262144. 

For both domains the form of examples are shown in Figure 3 and the form 
of hypothesised clauses in Figure 4. Note that in the KRK illegality domain 
negative examples are those preceded by a while the English past tense 
domain has no negative examples. The absence of negative examples in the 
English past tense domain is compensated for by a constraint on hypothesised 
clauses which Mooney and Califf [7] call output completeness. In the experiments 
output completeness is enforced in CProgol4.4 by including the following user 
defined constraint which requires that past/2 is a function. 

hypothesis(past(X,Y) ,Body,_) , clause (past (X, Z) .true) , Body, not(Y==Z). 

4.3 Method 

English past tense Mooney and Califf [7] compared the predictive accuracy of 
FOIL, IFOIL and FOIL on the alphabetic English past tense task. We interpret 
the description of their training regime as follows. Training sets of sizes 25, 50, 
100, 250 and 500 were randomly sampled without replacement from with 

10 sets of each size. For each training set E a test set of size 500 was randomly 
sampled without replacement from Epg^g^ \ E. Each learning system was applied 
to each training set and the predictive accuracy assessed on the correspond- 
ing test set. Results were averaged for each training set size and reported as a 
learning curve. 

For the purposes of comparison we followed the above training regime for 
CProgol4.4 in inductive mode. We also ran AP with the aleave predicate built- 
in to CProgol4.4 (Section 3) on each of the training sets and then for each 
training set size averaged the results over the 10 sets. 

KRK illegality Predictive accuracies were compared for CProgol4.4 using AP 
with leave-one-out (aleave) against induction with leave-one-out (leave). Training 
sets of sizes 5, 10, 20, 30, 40, 60, 80, 100, 200 were randomly sampled with 
replacement from the total example space, with 10 sets of each size. For each 
of the training sets both aleave and leave were run. The resulting predictive 
accuracies were averaged for each training set size. 

4.4 Results 

English past tense The five learning curves are shown in Figure 5. The hor- 
izontal line labelled “Default rules” represents the following simple Prolog pro- 
gram which adds a ‘d’ to verbs ending in ‘e’ and otherwise adds ‘ed’. 

past(A.B) split(A,B, [e] ) , split(B,A, [d] ) , !. 
past(A.B) split (B, A, [e ,d] ) . 

The differences between AP and all other systems are significant at the 0.0001 
level with 250 and 500 examples. Thus null hypothesis 1 is clearly rejected. 
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English past tense 




Fig. 5. Learning curves for alphabetic English past tense. 



KRK illegality The two learning curves are shown in Figure 6. The horizontal 
line labelled “Majority class” shows the percentage of negative examples in the 
domain. Only the accuracy difference between induction and AP for 200 exam- 
ples is significant at the 0.05 level, though taken together the differences are 
significant at the 0.0001 level. Thus null hypothesis 2 can be rejected. 

5 Discussion and related work 

The strong rejection of the two null hypotheses indicate that the advantages of 
AP relative to induction are domain dependent. The authors believe that AP has 
advantages for domains, like the English past tense, in which a large proportion 
of the examples must be treated as exceptions with respect to the hypothesis 
vocabulary. Note that KRK illegal contains exceptions, though they fall into a 
relatively small number of classes, and have relatively low frequency (a 2 clause 
approximation of KRK illegal has over 90% accuracy). By contrast, around 20% 
of the verbs in the past tense data are irregular. 

It should be noted that our implementation of AP has a tendency to over- 
generalise. This stems from the assymetry in constructing only clauses which 
make positive predictions in the aleave algorithm (Section 3). The tendency to 
overgeneralise decreases accuracy in the KRK illegal domain but increases accu- 
racy in the past tense domain, due to the lack of negative examples. Even when 
negative examples are added to the past tense training set, predictive accuracy 
is unaffected due to the output completeness constraint. 
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KRK illegality 




Fig. 6. Learning curves for KRK illegality. 



The AP accuracies on English past tense data shown in Figure 5 are the 
highest on this data set in the literature. However, it is interesting to note 
that CProgol’s induction mode results are as good as FOIDL. This contradicts 
Mooney and Califf’s claim that FOIDL’s decision list representation gives FOIDL 
strong advantages in this domain. 



5.1 Relationship of AP to analogy 

Evans’ [4] early studies of analogy concentrated on IQ tests of the form shown 
in Figure 7. Evans in was the first to implement a program for solving geometric 
analogy problems. These are problems of the form “A is to B as C is to ?” 
where the answer is one of five possible solutions, i.e. a multiple-choice format. 
The problems solved by his program were taken from actual high-school level 
test papers. The program, called ANALOGY, comprised two parts. Part 1 is 
given two line drawings A and B as input and calculates a set of properties, 
relations and “similarities” such as rotations and reflections which take A into 
B and relate C to each of the five possible answers. Part 2 forms a set of theories 
or transformation rules taking A into B. It then attempts to generalize these 
theories to cover additional data (C and the answer figures). This results in a 
subset of the admissible theories, i.e. transformation rules which take A into B 
and C into exactly one answer figure. Finally, the program chooses the most 
specific theory from these admissible theories. 

Evans notes that the program does very little search, which indicates that 
the hypothesis space of ANALOGY is highly constrained. The solution is chosen 
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7 



Fig. 7. IQ test problem of the type studied by Evans. 



on the basis of a specificity bias. Although Part 1 of the program includes a large 
amount of domain-specific knowledge, Evans is careful to point out that Part 2 
is a general purpose method for finding analogies of this form. 

Analogical reasoning is often viewed as having a close relationship to induc- 
tion. For instance, both Peirce and Polya suggested that analogical conclusions 
could be deduced via an inductive hypothesis [13, 14] . Similar views are expressed 
in the Artificial Intelligence literature [2]. For instance, Arima [1] formalises the 
problem of analogy as involving a comparison of a base B and target T . When 
B and T are found to share a similarity property S analogical reasoning pre- 
dicts that they will also share a projected property P. This is formalised in the 
following analogical inference rule. 

P{B) 

S{T) A S{B) 

P{T) 

The rule above can be viewed as involving the construction of the following 
inductive hypothesis. 

\/x.S{x) — > P{x) 

From this P{T) can be inferred deductively. Note that S and P can obviously be 
extended to take more than one argument. For instance, given the Evans’ type 
problem in Figure 7 we might formulate the following hypothesis as a Prolog 
clause. 

is_to(X,Y) invertall(X,Y) . 

In this way we can view analogical reasoning as a special case of AP, in which 
the example set contains a single base example and the test instance relates to 
the target. According to Arima the following issues are seen as being central in 
the discussion of analogy. 

1. What object (or case) should be selected as a base with respect to a target. 
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2. which property is significant in analogy among properties shared by two 
objects and 

3. what property is to be projected w.r.t. a certain similarity. 

These three issues are handled in the CProgol4.4 AP implementation as follows. 

1. A set of base cases is used from the example set based on maximising com- 
pression over the hypothesis space, 

2. relevant properties are found by constructing the bottom clause relative to 
the test instance and 

3. the relevant projection properties are decided on the basis of modeh decla- 
rations given to CProgol. 

6 Conclusions and further work 

In this paper we have introduced the notion of AP as a half-way house between 
induction and instance-based learning. An implementation of AP has been in- 
corporated into CProgol4.4 

(ftp://ftp.cs.york.ac.Uk/pub/ML_GROUP/progol4.4). In experiments AP pro- 
duced the best predictive accuracy results to date on the English past tense data, 
outstripping FOIDL by around 10% after 500 examples. However, on KRK illegal 
AP performs consistently worse than CProgol4.4 in inductive mode. We believe 
that AP works best with domains in which a large proportion of the examples 
must be treated as exceptions with respect to the hypothesis vocabulary. 

The present implementation of AP is limited in a number of ways. For in- 
stance, for any test instance x predictions must be binary, i.e. y G {True, False}. 
Also, because no constructed hypotheses are ever stored, AP cannot deal with 
recursion. It is envisaged that a strategy which mixed both induction and AP 
might work better than either. Thus some, or all, of the AP hypotheses could 
be stored for later use. However, it is not yet clear to the authors which strategy 
would operate best. 
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Abstract. Inductive Logic Programming considers almost exclusively 
universally quantified theories. To add expressiveness we should consi- 
der general prenex conjunctive normal forms (PCNF) with existential 
variables. ILP mostly uses learning with refinement operators. To ex- 
tend refinement operators to PCNF, we should first extend substitutions 
to PCNF. If one substitutes an existential variable in a formula, one of- 
ten obtains a specializtion rather than a generalization. In this article 
we define substitutions to specialize a given PCNF and a weakly com- 
plete downward refinement operator. Based on this operator, we have 
implemented a simple learning system PCL on some type of PCNF. 



1 Introduction 

Inductive Logic Programming learns a correct logic formula with respect to ex- 
amples. The definition of correctness has to do with the way examples are pre- 
sented. Suppose some interpretations are given as examples, formulas with these 
examples as models should be found. A downward refinement operator p can be 
used to search for such a formula. If a searching process begins with T = {□}, 
it should be replaced by the set of its refinements in p(T) because □ has no 
models. A refinement (/) C p(T) may have to be replaced by its refinements again 
because some given interpretations are not its models. This process can go on 
until correct formulas are found. 

Refinement operators have often been used f |S8IIHD!T7| ) to learn a correct 
universally quantified theory incrementally. If clause C subsumes clause D, then 
a refinement chain exists from C to D using elementary substitutions and adding 
literals. Let C = p{x,y) and D = p{x,x) V ^q{f{x)). Then a chain may be 
p{x, y),p{x, y) V ^g(z), p{x, y) V ^q{f{u)), p{x, x) V -^q(f(u)), p(x, x) V -^q(f(x)). 
If a correct universally quantified theory does exist, then a refinement chain from 
□ to every clause in this theory exists because □ subsumes every clause. 

Until now almost all ILP researchers are only interested in universally quan- 
tified clauses, especially definite program clauses. However, we may want to 
learn a concept expressed by a formula with existential variables such as a 
prenex conjunctive normal form (PCNF). To solve such problems, one consi- 
ders the universally quantified Skolem standard form ip of <p. It is well known 
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that 'll) \= 4> but often (j) ^ 'll). For instance, let the 3-ary predicate p be in- 
terpreted in the set of real numbers R as p{x, y, z) is true iff xy = z. The 
concept that for an arbitrary z G R, there are x, y such that xy = z^ can 
be expressed hy (p = \/z3x3y p{x,y,z'^). A standard form of the formula (f> is 
'll) = \/z p{f{z),g{z),z^) for new function symbols f,g. A model of (/) is also a 
model of ip only when / and g are interpreted in certain ways, e.g. f(z) = 2z 
and g{z) = zj2^ but we would like to check the truth value of (p directly. In a 
database we may have an integrity constraint x'iy3z^sell{x , y)\/ supply{x, y, z): 
if a shop X sells an item y, there must be a company z which supplies x with y. 
Of course we can define one particular supplier as /(x, y) but we have to change 
/ when we consider another supplier. To add expressiveness we should consider 
learning PCNF in general. 

We would also like to use (elementary) substitutions to refine a PCNF. The 
usual substitutions often generalize a formula instead of specializing it because 
of the existential variables. Let (p = \/x3yp(x, y) and ip = \/xp{x, x). Then 'ip \= <p 
but (p ^ 'ip. Therefore we will define a new type of substitutions to specialize a 
PCNF. Based on our substitutions and adding literals, we can define a do'wri'ward 
refinement operator p which is weakly complete: For every (p there is a refinement 
chain from T to (p, i.e., there is an n such that <p G p"(T). 

Generalizing and specializing a formula with existential quantifiers have also 
been considered in pEMl. A formula there involves only one clause (actually 
a clause presented in a special form). The variables in the head of a clause are 
universally quantified. The variables not in the head are quantified separately in 
the body by existential and universal quantifiers. Some rules are given to mani- 
pulate the variables only in the body. It seems that the rules are motivated by 
the following principle: if the body is generalized, then the formula is specialized 
and vice versa. jCFhbj adopts neither PCNF in general nor a uniform approach 
with substitutions. 

In this article we begin with establishing some properties of PCNF. We then 
define (elementary) substitutions and a downward refinement operator. We use 
a refinement operator to search in the set of theories expressed in PCNF. This 
differs from a classic refinement opertor which is defined in a serach space of 
universally quantified clauses. Most proofs will be omited (see mm)- We 
explain then briefly our first step in implementation. If /i, J 2 , . . . , In is a set of 
interpretations, the system PCL finds a PCNF (p such that every Ij is a model 
of (p. Hence we have generalized the Claudien system [Ti l )97| which only deals 
with the standard forms. 



2 Prenex Conjunctive Normal Forms 

In the first subsection we give some well known basic definitions and results in 
logic (see fNWDT^ L In the second subsection we state two important lemmas of 
ours. In the last subsection we consider the effects of adding literals to a clause 
in a PCNF. 
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2.1 Preliminaries 

Definition 1. An interpretation / with domain D of a first-order language L 
eonsists of the following: (a) Each n-ary function symbol f in L is assigned a 
mapping from D" to D. (b) Each n-ary predicate symbol p in L is assigned a 
mapping from _D" to {true, false}. 

Definition 2. Let I be an interpretation of the first-order language L with do- 
main D. Let V be a set of variables, then a mapping 9 \ V ^ D is called a 
variable assignment from V . Given a variable assignment 9 from the set of va- 
riables in a formula (f, we can check if (p is true or false under L and 9. Lf p is 
a closed formula, then the truth value of (p is independent of the variable assign- 
ment we choose. L is a model of the closed formula (p if (p is true in L. Eormula 
Tp logically implies formula <p, denoted by ip \= (p, if every model of ip is also a 
model of <p. ip and (p are called logically equivalent, denoted by ip ^ <p, if they 
have the same models. 

Definition 3. A clause is a disjunction of a finite number of literals. A prenex 
conjunctive normal form (PCNE) ip is a closed formula q\Xiq 2 X 2 ■ • ■ qnXn{Ci A 
C 2 A. . ./\Cm) where qi is a quantifier (3 or\/), Xi ^ Xj ifi yf j and Cj is a clause. 
qiX\q 2 X 2 . . . qnXn is the prenex of ip and C'iAC' 2 A. . ./\Cm is the matrix of ip. We 
denote ip often by q\Xiq 2 X 2 • • • . . .,Xn) orQ{x\,. . . ,Xn)M{xi, . . .,x„) 

or Q(x)M(x) or Q(ip)M{ip) or QM. We call variables in Q{ip) universal or 
existential; depending on how they are bound; the set of existential and uni- 
versal variables are denoted by eVar{ip) and uVar{ip), respectively. We have 
Va,i{ip) = uVar('0) U eVa,i{ip). 

Note that if {yi, . . . , ym} 2 {xi, . . . , Xn}, then Q{y)M{x) Q{x)M{x). 

Theorem 1. Let <p be a closed formula. Then there exists a PCNE ip such that 
(p and Ip are logically equivalent. 

2.2 Some Properties of PCNF 

We often need to check if some interpretation / is a model of a formula ip. Lemma 
0in this subsection gives a necessary and sufficient condition for / to be a model 
of Ip. Lemma I 2 I tells that the truth value of a formula has often to do with the 
positions of variables in the prenex. Both lemmas will often be used for proving 
other results. We will often use variable assignments the same way as they are 
usual substitutions. 

Example 1. Let ip = 3x\/yp{x, f{y)) and p = \/y3xp{x, f{y)). Then ip \= p: Let 
I be an interpretation with domain D. Then p is true in L iff there is a, d € D 
such that the formula yyp{d, f{y))) is true in / iff there is d G D such that for 
every e G D, p{d,f{e)) is true. On the other hand, p is true in / iff for every 
e G D, 3xp{x, /(e)) is true iff for every e G D, there is d G D such that p{d, /(e)) 
is true. Notice that the choice of d here may depend on e. Since / is a model of 
p, for every e G D, we can choose the same d G D such that p{d, /(e)) is true 
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in I. Consider the Herbrand interpretation I = {p{t, \ t ground}. Then I is 

a model of (j). For an assignment a = {x/t}, let 7 = {y/t} then cj){a U 7) is true. 
We can not say / is a model of ijj because 7 depends on a. We can generalize 
this example to the following lemmas 

Lemma 1. Let 4 > = qiX\ . . .qnXnM{x\, . . . ,Xn)- An interpretation I with do- 
main D is a model of (f> ijf for every a : uVar((/)) — > D there is a j : eVar(^) — > D 
such that the following two conditions are satisfied: (a) M((T U 7) is true in 
I . (6) the definition of 7 on Xi depends only on how a and 7 are defined on 
{xi,X2, ■ ■ ■ ,Xi-i}, i.e., an element in D can be assigned to Xi after the assign- 
ment of all Xj,j < i has been done. 

It is possible to extend the next lemma to some other situations. We prove 
only the case which is need for defining our refinement operator. 

Lemma 2. Let 

tjj = qiXi . . . 3 xi.. . \fXj . . . qnXnM, 4 >= qiXi. .. Vxj . . , 3 xi . . . qnXnM 
and suppose there is no other existential variable between Xi and Xj in the pre- 
nexes of these two formulas. Then ip \= (f>. 

Proof . Suppose / is a model of ip. We^ant to prove that / is also a model 
of (p. Let a : uVar(i^) ^ D. By Lemma y, there is a variable assignment 7 of 
eVax{ip) such that M{ip){aUj) is true under /. The assignment of Xk G eVa,i{ip) 
may depend on the definition of a and 7 on variables before Xk in Q{ip). Since 
uVar('0) = |Uyar((/)) and M{ip) = M{(p), we have M{(p){a U 7) true under J. 
By Lemma y we can say / is a model of ^ if 7 in Xk G eVar(</)) depends on 
the assignment of variables before Xk in Q{(p). Note that Q{(p) interchanges only 
the order of 3 xi and \/xj in Q{ip). The assignment of Xi in 7 depends only 
on the assignment of 17,7 on x\,. . . ,Xi-\ which appear before 3 xi in Q{(p). If 
Xk G eVar((/)) is another existential variable, then it is not between Xi and Xj so 
x\, . . . , Xk-i are still before Xk in Q{ 4 >). That means / is a model of (p. 

2.3 Adding Literals 

A classic refinement step for a universal quantified PCNF extends a clause in 
the matrix with an extra literal containing new variables. This will also be done 
for our refinement operator for PCNF. For this we need the the following results 
which are based on the fact that the disjunctions of formulas is true if at least 
one of them is true. 

Lemma 3. Let ip = qiX\ . . . qnXnCi A C2 . . . A Cm where every Ci is a clause. 
Let L be a literal which contains only variables y±, . ■ ■ ,yk that are new w.r.t. ip. 
If (p = dixi . . . qnXnSyi ...yyk Cl A ... A (Cj V L) ... A Cm, then ip \= (p. 

We call the action in Lemma 0 adding a u-literal. Similarly we can prove the 
following lemma which adds an e-literal to a formula. 

Lemma 4. Let ip = qiX\ . . . g„a;„Ci A C2 . . . A Cm where every Ci is a clause. 
Let L be a literal which contains only new variables y\, ... ,yk that are new w.r.t. 
Ip. Lf (p = 3 yi . . . 3 ykq\Xi . . . Ci A . . . A (Cj V L) . . . A Cm, then ip\= (p. 
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3 Substitutions and Specializations 

We are used to that a substitution 9 replaces some variables with terms. For a 
universally quantified clause C we have always C ^ C6. This is not valid for 
PCNF as the following examples show. Thus we are motivated to define a new 
type of substitutions which specialize PCNF. 

Example 2. Consider the following implications: 

'ixp{x) \= p{a) and p(a) ^\/xp{x) 
p{a) 1= 3xp{x) and 3xp{x) ^ p{a) 

3xp{x,x) ^ 3x3yp{x,y) and 3x3yp{x,y) 3xp{x,x) 

A unification of two universally quantified variables does not always specialize 
a PCNF. Let ip = yx3y\/zp{x, y, z), </> = yx3yp{x, y, x), 4>' = 3y\/zp{z, y, z), and 
I = {p{t, ground}. Then / is a model of ip and (p but not a model 

of (p' . For <p' true in /, we need an s such that p{t, s, t) true for every t. 



3.1 Elementary Substitutions for PCNF 

A matrix can be pictured as a tree, with the root on top. At each node, number 
downgoing branches 1, 2, 3, etc. from left to right. Each node and the tree 
hanging from it is given by the path that leads to it from the top. For example, let 
M = p{x, y) A {p{x, x) V ~^q{f{x))). The second clause has position (2). ~^q{f{x)) 
has position (2,2), f{x) has position (2,2,1), etc. 

Definition 4. An substitution for a matrix M has the form 9 = {(ti/si,pi), . . . , 
(t„/s„,p„)|. M9 is a matrix formed by using M and 9: for every i, the term 
at position pi in M is U and U should he replaced by Si . For example, if M = 
p{x, y)A{p{x, x)y^q{f{x))) and 9 = {(x/f(z), (1, 1)), {f{x)/g{z), (2, 2, 1))}, then 
M9 = p{f{z),y) /\ {p{x , x) y ^q{g{z))) . It is easy to see that the old definition of 
a substitution is a special case of the new kind of substitutions. In such a case 
we use the old notation where the positions are not needed. 

Definition 5. Let ip = q\Xi . . . qnXnCi A ... A Cm = Q{'f’)M{ip) he a PCNF. 
There are the following 5 types o/ elementary substitutions for ip: 

The first two types have to do with universal variables: 

— Let Xi,Xj G uVar('!/)) and i < j. An elementary u-unification 9 = {xj/xi} 
can be applied to ip such that ip9 = qiXi . . . q„Xn{M{ip)9). For example let 
Ip = yxyyp{f{x),y) and 9 = {y/x}. Then ip9 = yxyyp{f{x),x) which is 
equivalent to yxp{f{x),x). 

— Let Xi G ViVax{ip). If t = f{y\, . . . ,yk), where yi,...,yk are new distinct 
variables w.r.t. ip, then 9 = {xi/t} is called an elementary u-substitution. 
Let ip9 he the new formula constructed as follows. All Xi~ occurrences in the 
matrix of ip are replaced by t simultaneously, i.e. M{ip9) = M{ip)9. Moreo- 
ver, the yxi in the prenex of ip is replaced by VyiV ?/2 • ■ -Vj/fc. For example, 
let Ip = yx3yp{x,y) and 9 = {x/ f{u,v)}. Then ip9 = yuyv3yp{f{u,v),y). 
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The next two types have to do with existential variables: 

— Let Xi G eVdx{ip) and let {{xi,pi), . . . ,{xi,pk)} he a proper subset of the Xi~ 
occurrences in M{ip). If z is a new variable, then 0 = {{xi/ z,pi), . . . , 
(xi/z,pk)} is called an elementary e-antiunification and ip6 is the PCNF 
qiXi . . . 3a;i3zgi+ia;i_|_i . . .qnXn{M9). For example, letf) = 3xp{x,x) and 6 = 
{x/z,{2)}. Then fiO = 3x3zp{x,z). 

— Let t = /(ccij, . . . ,Xi^) which contains only distinct existential variables. Let 
ij = maxjii, . . . ,im}. Let {(t,pi), . . (t,pk)} be occurrences in Mfip). If z 
is a new variable, then 0 = {(t/z,pi ), . . . , (t/z,pk)} is called an elementary e- 
substitution for if. We define if0 = qiX\ . . . qi^Xi3zqi.+\Xi^+i . . . qnXn{M0). 
For example, let ip = \/x3y3u3v p{x,u) /\{p{x,y) V -^q{f{u,v))). If 0 = 
{{f{u,v)/z, (2,2,1))}, then ip0 = \/x3y3u3v3z p{x,u) A {p{x,x) W ^q{z)). 

The last type has to do with interchanging the places of an existential and a 
universal variable in Q(ip): 

— Suppose Xi G eVar('0), Xj G uVar('!/>) and i < j. If there is no other existen- 
tial variable between Xi and Xj in Q{ip), then {{xi,Xj)} denotes an elemen- 
tary eu-substitution. It interchanges the positions of Xi and xj in Q{ip). For 
example, let 0 = {{x , y)} . Then {3x'iyp{x,y))0 = ^y3xp{x,y). 

3.2 Substitutions and Specializations 

For every elementary substitutions 0 w.r.t. ip we can prove that ip ^ ip0. For 
instance, 

Lemma 5. Let ip = Q{ip)M{ip) he a PCNF. Let 0 = {{t/z,pi), . . ., (t/z,pm)} 
be an elementary e-substitution. Then ip \= ip0. 

Proof . Let / be a model of ip. Let t contain only the existential variables 
Xi^,...,Xi,^ where ii < i^-.- < ik- Then ip0 = qiXi . . . qi,^Xi,3zqi^^+iXi,^+i 
. . . qnXnM0. M{ip) differs from M{ip0) only at position pi,p 2 , . . . ,Pk- Clearly, 
ViYax{ip0) = uVar(')/>) and eNa,r{ip0) = eVa,r{ip) U {z}. Since / is a model of ip, 
and the assignment a is also an assignment of uVar(^), by Lemma E there is 
a variable assignment 7 of eVai{ip) to D such that M{ip){a U 7) is true un- 
der I. Moreover, if Xi G eVax(ip), then 'y{xi) depends only on how a and 7 
beheave on variables before Xi in P(ip). Notice that at every position pj in 
M{ip){a U 7) we have in fact t'y. We consider the subsitution 7' = 7 U {z/t-^}. 
Then M{ip){a U 7) = M{ip)0{a U 7'). Notice z is behind all variables Xi, . . . , Xi^. 
in P(ip0). Thus 2: depends also only on how a and 7' defined on the variables 
before it. By applying Lemma^ we have / is a model of ip0. 

We can combine the elementary substitutions to get the following definition 
and theorem. 

Definition 6. Let ip be a PCNF. Suppose 0\ is an elementary substitution w.r.t. 
Ip and 0i is elementary w.r.t. {. . . (ip0i)02 . . .)0i-i for every i = 2,...,n. Let 
0 = 01 . . . 0n be the composition of these 0i, i.e. ip0 = {. . . {ip0\)02) ■ ■ -)0n- Then 
0 is called a substitution w.r.t. ip. 

Theorem 2. For a PCNF ip and a substitution 0 defined as above, ip ^ ip9. 
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4 Downward Refinement Operator 

Let the search space S be the set of all PCNF of a first order logical language 
which has a finite number of function and predicate symbols. Let the top T € S 
be the conjunction of a positive number of empty clauses, i.e. T = □ A □ . . . A 
□ . The number of □ is irrelevant because they are all equivalent to false. A 
refinement operator on A is a function p : S ^ 2^ (the set of all subsets of 
S). A refinement operator p is downward if for every ip and <f G p(pp), we have 
f) \= (j). A refinement chain from to ^ is a sequence . . . ,ifk in S such 

that ipo -ip, -ipk ^ 4> and ifi G p{ifi-i). 

Definition 7. Let ip = qiXi . . . A C 2 . . . A Cm- Let p he a refinement 

operator on S defined by the following items. 

Note that there are only a finite number of non- alphabetical variants in p{ip). 

The first three items have to do with universal variables: 

— 1. For an elementary u-unification 6 = {y/x\, where x,y € uVax^ip) and x 
comes before y in Q{ip), let ip9 G p^ip). 

— 2. Let X G uVar(^/>) and an elementary u-substitution 0 = {x/ f{yi, . . . , yk)}, 
let ^p9 G p(V’)- 

— 3. Let L = p{yi, . . . ,yk) or ^p{yi, . . . ,yk) where yi, . . . ,yk axe distinct varia- 
bles new w.r.t. ip. For an arbitrary j = 1, . ■ ■ ,m, let pj be defined by adding 
a u-literal to ip: pj = qiXi . . . qnXrP^yi . . . yyk Ci A . . . A {Cj V L) A ... A Cm- 
Let pj G p{ip)- 

The next 3 items have to do with existential variables: 

— 4- Let X G eVax{p) and {{x,pi), . . . , (x,pk)} be some (not all) x-occurrences 
in M{p). For a new variable y and an elementary e-antiunification 9 = 
{{x/y,Pi), ---, (x/y,Pk)}, let p9 G p{p). 

— 5. Let {t,pi) = (/(j/i, . . . , ys),Pi), i = 1, . . . , fc be term occurrences in M(ip) 
such that all yj are distinct and in eWdx{p). Then for the elementary e- 
substitution 9 = {{t/y,pi), . . . , (t/y,Pk)}, let p9 G p{ip)- 

— 6. Let L = p{y, - . - ,y) or L = ~^p{y, - - - ,y) be a literal with new variable y. 
Then for j = 1, . . . ,m, let pj G p(ip) be defined by adding an e-literal, i.e. 
pj = 3yqixi . . . q„Xn Ci A . . . A (Q V L) . . . A C'm- 

The last item has to do with eu- substitutions: 

— 1. Let X G eVar('0) and y G uVar(i/;) and suppose x comes before y in Q{ip). 
For the elementary eu-substitution 9 = {{x,y)}, let p9 G p(p) 



Theorem 3. Lf p G p(ip), then p \= p, i.e. p is a downward refinement operator. 
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5 The Completeness of the Refinement Operator p 

In this section we will show that the refinement operator p is weakly complete 
in S. That is to say, for every (/) in S, there is a finite refinement chain from T 
to (j). We show this by the following steps. Example 0 will illustrate these steps 
more concretely. 

1. Replace every existential variable in by a new constant not in (f> and 

remove all the existential variables from Q{(j>). Let the new PCNF be t/j. Then 
the variables in ip are universally quantified. Let M{ip) = Ci A C 2 . . . A Cm- 
We have then □ subsumes C (□ ^ O for all i. 

2. Similar to a result about the classic refinement refinement operator 
iiNwy/i . we can prove that there is a chain from □ to every Ci. The combi- 
nation of these chains will give a chain from T to ip. 

3. By using the elementary e-substitutions (item 4 of p) we can change the 
constant occurrences in ip back to existential variables. This establishes a 
refinement chain from ip to (p' which looks almost like (p but all existential 
variables appear before the universal variables. 

4. Using eu-substitutions (item 7 of p) we can move the existential variables 
to the right places in the prenex. This means there is a chain from tp' to <p. 
Thus we have the weak completeness of p. 

Example 3. We will give an example to show what a concrete finite chain from 
T to a given cp = yx3y{{^p{x) V q{f{x)) V q{y)) A r{y,a)) looks like. Note that 
the chain is not unique. Such a chain exists for a general <p (Theorem^ because 
of the following lemmas. We use an arrow to denote a refinement step which 
uses the n-th item of p. 

2 

Va;VuVuVrcVw'((^p(a;) V p{u) V q{v)) A r{w, w') — > 

'ix\/ui'iv\/w\/w'{{^p{x) V p{f{ui)) V q{v)) A r{w, w') 

\/x\/v\/w'{^p{x) V p{f{x)) V q{v)) A r{v, w') 

Ip = \/x{-^p{x) V p{f{x)) V q{b)) A r{b, a)) 

(p' = 3y\/x{^{p{x) V pifix)) V q{y)) A r{y, a) 

(p = 'ix3y{^p{x) V p{f {x)) V q{y)) A r{y, a) 

Lemma 6 . Let C be a universally quantified clause. Then there is a finite chain 
of refinements from □ to C. 

Lemma 7. Let ip = Q{ip)M{ip) be a universally quantified PCNF. Then there 
is a finite p-chain from T to ip. 

Lemma 8 . Let (p = 3yi . . . 3y„Vxi . . .\/XmNI{(p) whose existential variables in 
the prenex appear before the universal variables. Let 61 , . . . , be different con- 
stants which do not appear in <p. Let the universally quantified ip be (p after 
replacing variable yi by bi. Then there is a finite p-chain from ip to (p. 

Theorem 4. Given (p = Q{(p)M{(p), there is a finite p-chain from T to <p. 
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6 Learning PCNF in Practice 

Based on the refinement operator given in the previous sections, we have exten- 
ded a simple version of the Claudien system (see nrrwn to a learning system 
PCL (abbreviation of PCNF Claudien), and implemented it in Prolog. These 
two systems use learning by intervretations \k^7\ . They learn actually a set of 
regularities such that they are true in the interpretations. For instance, the inte- 
grity constraint about the suppliers in section 1 can be considered as a regularity 
in the database although these clauses of regularities do not imply any ground 
atom (see Introduction for examples which are presented as ground atoms). 

Claudien learns a universal clausal theory w.r.t. a set of positive examples 
(interpretations), such that each example is a model of the theory. This clausal 
theory can be considered as a set of regularities (or integrity constraints) satisfied 
by these examples. PCL is able to learn a PCNF which is more expressive. 

Consider a finite set of scenes {bongard interpretations) as positive examples. 
A scene contains several figures: each figure has properties like shape, size,... and 
these figures are related to each other indicated by in, above, left_of, .... Claudien 
is able to find things like: ^x'dy{shape{y, circle) <— figure{x), figure{y),in{x, y)), 
i.e., for every figure x and y in a scene, when x is inside y, then y must 
be a circle. The following rule cannot be found by Claudien, but PCL can: 
\/x3y{{in{x,y) <— shape{x, triangle)) A shape{y, circle)), each triangle is in at 
least one circle. 

To extend Claudien towards PCL, several issues have to be addressed. In the 
following subsections we discuss the most important ones. 



6.1 Testing Interpretation in a Search Space 

In Claudien, a clause head <— body is true for an example if the Prolog query 
? — body, not {head) fails for that example. This works only if the clauses are 
range restricted, meaning that all variables in the head should also occur in 
the body. Indeed, consider a Herbrand interpretation I = {p{a,b), q{a)} and 
(j) = yx'iy{p{x,y) <— q{x)). (f) is false because p{a,a) ^ q{a) is false. We get No 
as the answer of the query because there is a refutation of false <— p{x,y) and 
P{x,y) ^ q{x). 

For PCL, we also consider range restricted PCNFs. The following definition 
can be found in a PCNF (j) in S is range restricted iff 

— If a: G uVar(<()) and x is in a positive literal of a clause C (in head{C)) in 
M{(j)), then x must also appear in a negative literal of C (in body{C)). 

— If X € eVar(i^) and x is in a negative literal of a clause C (in body{C)) in 
M{(f>), then there is a clause D in M{(p) with only positive literals (no body) 
such that X appears in every literal in D (in every atom of head{D)). 

Intuitively, a range restricted formula is structured in such a way that for 
an interpretation, the range of a variable is restricted to the elements defined 
by some other relations in the same formula. For example, 3x{ false <— p(x))) is 
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not range restricted. To test this kind of rule, we need an explicit domain for the 
variable x. Let (p = Vcc3y(r(y) A {q{x) ^ p{x,y))) and / = {q{a),r{a),p{b,a)}. 
Then (p is range restricted. To check if / is a model of p, we need only to consider 
y where r{y) is true, i.e. {y/a}. 

A search space S for PCL consists of range restricted, function free PCNFs 
(the only functions are constants). Moreover, in a PCNF there are upper bounds 
for clauses in a matrix, literals in each clause, constants and variables, hence S 
is a finite set. The search space is further restricted by some language bias. For 
instance, the types of the arguments of each predicate must be declared. Also, 
a declaration is necessary for each type for which constants must be generated. 
This allows for example to specify some types where no constants should be 
generated, 

6.2 The Downward Refinement Operator 

For the implementation of the refinement operator p, several issues must be 
addressed. 

First, for efficiency it is necessary to optimize the refinement operator. We 
should try to avoid deriving equivalent PCNF. One way to do this is to define 
an order in which literals occur in each clause and an order of the clauses. This 
removes equivalencies obtained by applying assiociativity and commutativity 
rules, e.g. we can obtain false <— p A g in two ways: We can start from the 
empty clause false, first add p to obtain false ^ p and then add q to obtain 
false ^ p Aq. We could also first add q to obtain false <— q and then add p to 
obtain false ^ qAp but the latter is not allowed because p comes alphabetically 
before q. Such orders are also considered in Claudien using the DLAB language 
bias. 

Second, we need to address the following problem: For every f-th clause in a 
matrix with n{> i) clauses, we can start from the conjunction of i empty clauses 
Thus we have to search n trees. The computational cost of searching in the t-th 
tree grows very fast as i increases. We can solve this problem partially by reusing 
the formulas we have found. If qiX\...qkXkC\ A .. A Ci-\ A Ci is a PCNF with i 
clauses which is correct w.r.t. the examples, then qiX\...qkXkCi A .. A Ci-i must 
also be a correct PCNF. To find the correct PCNF with i clauses, we can start 
from a correct PCNF containg i — I clauses with an extra empty clause added. 
The refinement operator can be applied again and again until good PCNFs are 
found. 

7 Experiments 

In this section we present some experiments which illustrate PCL and show that 
PCL can learn rules which can not be learned by existing systems that only 
learn single clauses. 

Experiment 1 We considered a set of undirected graphs as positive examples. 
A possible example: 
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r(a,b). r(b,a). r(a,c). r(c,a). 

r(a,d). r(d,a). r(e,a). r(a,e). 

Every graph had the property that each point was connected to at least one 
other point. PCL found this and also some other rules: 
yxyy{r{x,y) ^ r{y,x)), yx3y{r{x,y) ^ point(a;)) 

Note that the last PCNF is made range restricted by adding a predicate point. 
PCL automatically makes PCNF formulas range restricted by using the language 
bias and the definitions of the domain predicates. 

Experiment 2 We also did an experiment on some bongard-like examples men- 
tioned at the beginning of last section. These are scenes of figures (in this case 
triangles and circles) which are related to each another. One example is 

figure(a) . figure (b) . figure(c). figure (d) , 
in(a,b) . in(c,b) , 

shapeCa, triangle) . shape (c, triangle) . shape (b , circle) . shapeCd, circle) . 

The search space is large and many correct formulas are possible. This means 
that, even more than the case of Claudien, many trivial rules are found which 
do not give much new knowledge: 

3a;(shape(x, triangle)), 3x3y(in(a;, y)), 

3y(figure(?/) A (false ^ shape(y, triangle))). 

After the search continued for some time, more interesting results were given, 
such as: 

Va;3t/(shape(j/, circle)A (in(a;,?/) <— shape(a;, triangle)). 

8 Conclusion and Future Work 

As we know, every closed formula is equivalent to a PCNF but not necessa- 
rily its Skolem standard form. Until now we consider in ILP almost exclusively 
conjunctions of universally quantified clauses, especially Horn clauses. To add 
expressiveness we should consider PCNF in general. 

If we want to extend refinement operators from sets of clauses to sets of 
PCNF, we should first extend substitutions to PCNF. In this article we have 
defined the substitutions which specialize a given PCNF. Elementary substitu- 
tions and adding literals can be used to define a refinement operator p which 
is weakly complete. Notice that we have not used items 5 and 6 in p for weak 
completeness. In a set of formulas ordered by some kind of generalization, a 
refinement operator is complete if there is a refinement chain from ip to (f) whe- 
never Ip is more general than <p. For example, item 5 is needed when we consider 
Ip = 3xp{x,x) and (p = 3x3y p{x,y). We would like to know more about the 
search spaces where p is complete and the relation between item 5, 6 and the 
completeness. 

This article not only lays a theoretical foundation for PCNF learning systems 
but also demonstrates a simple system PCL which is already implemented. This 
system deals with a finite search space of function free and range restricted 
PCNF. 
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Abstract. In this paper we examine the problem of repairing incomplete 
background knowledge using Theory Recovery. Repeat Learning under 
ILP considers the problem of updating background knowledge in order to 
progressively increase the performance of an ILP algorithm as it tackles 
a sequence of related learning problems. Theory recovery is suggested as 
a suitable mechanism. A bound is derived for the performance of theory 
recovery in terms of the information content of the missing predicate de- 
hnitions. Experiments are described that use the logical back-propagation 
ability of Progol 5.0 to perform theory recovery. The experimental results 
are consistent with the derived bound. 



1 Introduction 

In a previous paper the authors described an extension of the standard ma- 
chine learning framework, called repeat learning. In this framework, the learner 
is not trying to learn a single concept, but a series of related concepts, all drawn 
independently from the same distribution T>. A finite sequence of examples is 
provided for each concept in the series. The learner does not initially know T>, 
but progressively updates a posterior estimation of I? as the series progresses. 

Under Inductive Logic Programming (ILP) the learner’s estimation of T> 
depends on the linguistic bias conveyed by his hypothesis language. The ILP 
learner can therefore alter the estimation of T> by making changes to the hypo- 
thesis language. The previous paper discussed a mechanism for this process 
that adjusted the background knowledge using predicate invention. 

One can quantify the expected performance of an ILP algorithm by bounding 
the expected error of a hypothesis formed given the number of examples seen. 
Previous bounds for Progol m have only considered the situation in which the 
learner knows the distribution T>. In this paper, we construct a bound for the 
case when the learner’s estimate is incorrect. Significantly, this bound describes 
the difference between the estimate and the true distribution T> in terms of the 
missing information content in the hypothesis language used by the learner. 

Theory recovery is the process of adjusting, or completing background kno- 
wledge. In theory recovery, an incomplete logic program is reconstructed on the 
basis of examples. The examples indicate the desired behaviour of a particular 
predicate in the program, defined in terms of the background knowledge. A new 
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version of the ILP algorithm Progol, version 5.0, uses logical back-propagation to 
perform theory recovery. 

The paper is structured as follows. Section 0 describes the formulation of 
the new expected error bound, with its proof given in Appendix O Section 
0 describes experiments that give results that are consistent with this bound. 
A complete set of results is given in Appendix^ Finally, our conclusions are 
drawn in Section 0 Appendix El briefly describes the mechanics of logical back- 
propagation in Progol 5.0. 



2 A Theory Recovery Error Bound 

It was shown in |S| that under suitable assumptions the class of all polynomial 
time-bounded logic programs is (U-)learnable. In P) explicit upper bounds were 
given for the error of a Progol-like learning algorithm. The paper considered 
the case of positive-only learning compared to the more traditional positive and 
negative setting. In both cases upper bounds on expected error were derived 
showing that learning could be efficiently achieved. However under both models 
a strong assumption was made: that the learner knows the prior distribution 
(used by the teacher) over hypotheses. Clearly there is no reason why this should 
in general be true. In particular, in the following result we assume that the 
background knowledge of the learner is missing some predicate (s) contained in 
the target concept. In Subsection 12 . 1 1 we review the average case Bayesian model 
of learning used in 0 to analyse the expected error of a Progol-like learner. 
In Subsection 12.21 we modify the model by assuming that the learner is not in 
command of the correct prior distribution and derive adjusted upper bounds. 



2.1 Known Prior 

The following is a version of the U-learnability framework presented in 0 and 
restated in 0. 



The Model X is taken to be a countable class of instances and C 2^ to be 
a countable class of concepts. Dx and D-u are probability distributions over X 
and H respectively. The teacher randomly chooses a target theory T from Du 
then randomly and independently chooses a series of examples E = {x \, . . . , Xm) 
from Dx and classifies them according to T. 

Given E, Du and Dx a learner L outputs a hypothesis H The error of 
the hypothesis is measured as Dx{H \T) Dx{T \ H). 

The hypotheses in are assumed to be ordered according to decreasing prior 

probability as Hi,E[ 2 , The distribution Du{Hi) = ^ is assumed, where 

a = 6/tt^ is a normalising constant. This is similar to the prior probability 
assumptions used in Progol 4.1 jS). This distribution is a smoothed version of 
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the universal distributior^ which assigns equal probability to the 2^ hypotheses 
describable in b bits, where the sum of the probabilities of such hypotheses is 
2 - 6 ^ 



Expected Error The following theorem (stated and proved in gives an 
upper bound on the expected error of an algorithm which learns by maximising 
the Bayes’ posterior probability over the initial am hypotheses within the space. 

Theorem 1. Let X be a countable instance space and Ti C 2^ be a counta- 
ble hypothesis space containing at least all finite subsets of X. Let D-h, Dx 
be probability distributions over Ti. and X. Assume that Ti. has an ordering 
Hi,H 2 , ■ ■ ■ such that Du{Hi) > Du{Hj) for all j > i. Let D-j-ciHi) = ^ where 

a ~ Sfci F “ '■ Hi G H and i < n}. T is chosen randomly 

from Dji. Let ex{x,H) = (cc, u) where v = True if x G H and v = False other- 
wise. Let E = (ex{xi,T ), . . . , ex{xm, T)) where each Xi is chosen randomly and 
independently from Dx- He = {x : {x,True) G E}. Hypothesis H is said to be 
consistent with E if and only if Xi G H for each (xi,True) in E and xj ^ H 
for each {xj. False) in E. Let n = am. L is the following learning algorithm. Lf 
there are no hypotheses H G Tin consistent with E then L{E) = He. Otherwise 
L{E) = Hn{E) = H only if H G Tin, H consistent with E and for all H' G Tin 
consistent with E it is the case that D-e(H) > D-e(H'). The error of a hypothe- 
sis H is defined as Error{H,T) = Dx{T\H) -\- Dx{H\T). The expected error 
of L after m examples, EE(m), satisfies: 



EE{m) < 



1.51 + 21nm 
m 



( 1 ) 



2.2 Unknown Prior 

We consider an extension of the above model. Previously it was assumed that 
the learner knew precisely the distribution D-h from which the target concepts 
were drawn. Clearly there is no reason why this should hold in practical machine 
learning situations. We now relax this assumption and consider what happens 
to the expected error of learning when the learner does not know the exact 
distribution D-h. In particular we consider an incorrect prior over hypotheses 
induced by incomplete background knowledge. 

The Modified Model We assume the existence of a universal linguistic bias 
generator G that, given a target space Ti, and an hypothesis language B for it, 
returns a probability distribution D'e = G{Ti, B) over the target space. Occam’s 
razor can be taken as an example of such a generator, that always gives a dis- 
tribution that assigns a higher probability to hypotheses that can be expressed 
more simply in the hypothesis language. 

^ If we take the universal distribution to be u{Ti„) = then the proba- 

bility of the 2”th hypothesis is u{Ti 2 ^) = 2“^" = ( 2 ^y^ ■ So ^ is a smoothed and 
renormalised version of u. 
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The teacher selects the target concept from 71 according to the distribution 
Z? 7 -. The learner’s imperfect hypothesis language Be induces an incorrect pro- 
bability distribution Dc = G{'H,Be)- We assume the existence of some set of 
predicate definitions P such that G(7f, {Be U P)) = D-r- In other words a set of 
“missing” predicates, that, if added to the background knowledge of the learning 
algorithm, would mean that the the learner’s induced distribution De would be 
the correct one. The hypotheses in H are ordered by the teacher according to 

decreasing prior probability Dr {Hi) — ^ as Hi, H 2 , ■ ■ ■ , Hi, The learner 

only has partial information about this ordering in that its prior is a corrupted 
version of the teachers distribution. In particular, there is a set of hypotheses 
Tip C for which H G Tip ^ Dr{H) > De{H). Let the information content 
in bits (see, for example 0) of an hypothesis relative to a distribution D be 
given by miop{H) = — log 2 {D{H)). The information content of H G Tip under 
the learner’s distribution De is more (in bits) than the information that would 
be assigned under the teacher’s distribution Dp- 

Lemma 1. H G Tip is given different indices Hi and Hj under the orderings 
induced by Dp and Dp- Assume that the learner’s prior distribution satisfies 
De{Hj) = j 2 ■ If the information content of the “missing” predicates in P is at 
most k bits, then for any hypothesis H G Ti the indices i and j satisfy j < 2^/^ i. 

Proof. If H GTip, then 

k > infooffH) -\nfoDr{H) = - log2 L>c(iL) -k log2 Dp{H) 

= -l0g2 +l0g2 ^ = 2l0g2 j - 2l0g2* = 2l0g2 - 

jZ ^ 

Therefore j < 2^/^ i. Clearly, if H ^ Tip, then j < i < 2*/^ i. 

Expected Error 

Theorem 2. Let X be a countable instance space and Ti C 2^ be a countable 
hypothesis space containing at least all finite subsets of X. Assume the existence 
of a universal linguistic bias generator G that, given a target space Ti, and an 
hypothesis language B for it, returns a probability distribution Du = G{Ti, B) 
over the target space. Let Dx be a probability distribution over X. Let Dp, 
De be probability distributions overTi, where Dp = G{Ti,Bp) is the probability 
distribution induced by the learners hypothesis language. Assume the existence 
of some set of predicate definitions P such that G{Ti, {Bp U P)) = Dp. Let the 
information content of the predicate definitions P be at most k bits. Assume 
that TI has an ordering Hi, H 2 , . . . , Hi, . . . such that Dp{Hi) > Dp{Hipi) for 
all i and an ordering H[, H' 2 , . . . , Hj, . . . such that Dp{Hi) > Dp{Hj_^_i) for 
all j. Assume Dp{Hi) = and Dp{H'^) = where ^ W = tt^/6. 

Let Ti'j, = {H{ : H{ G Tt and i < n}. T is chosen randomly from Dp. Let 
ex{x,H) = (x,v) where v = True if x G H and v = False otherwise. Let E = 
{ex{xi,T), . . . , ex{xm,T)) where each Xi is chosen randomly and independently 
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from Dx- He = {x : {x,True) € E}. Hypothesis H is said to be consistent with E 
if and only if Xi € H for each {xi,True) in E and Xj ^ H for each {xj, False) in 
E. Let n = am. L is the following learning algorithm. If there are no hypotheses 
H G H'n consistent with E then L{E) = He. Otherwise L{E) = Hn{E) = H 
only if H G H consistent with E and for all H' G consistent with E 
it is the case that Dc{H) > Dc{H'). The error of an hypothesis H is defined 
as Error{H,T) = Dx{T \ H) + Dx{H \ T). The expected error of L after m 
examples, EE(m), satisfies: 



EE{m) < 



1.51 + 2 In m + A: In 2 
m 



Proof. Given in Appendix El 



( 2 ) 



3 Experiments 

To confirm the assumptions given to derive the bound given in Equation |2| 
the following experiments were devised and run. The experiments made use of 
the logical hack-propagation abilities of Progol 5.0. This ILP algorithm uses an 
augmented version of Inverse entailment that includes the completion of back- 
ground knowledge in the generalisation process. The mechanism is described in 
Appendix IBl 

The aim of the experiments were to determine how the accuracy of the logic 
program would be affected by having a percentage of the clauses of the pro- 
gram removed, and then using Progol 5.0 to repair the program, given a varying 
number of examples. 

3.1 The Experimental Domain 

The experiments are conducted in an artificial domain, called the base-n-string 
domain. The elements of the domain are strings in base n, where n G {2, 3, 4, 5}, 
of length up to a maximum value 1. 

The target program has a distinct predicate for each distinct length of string. 
The predicate that defines strings of length m is defined in terms of the predicate 
that defines strings of length m — 1. This means that a missing clause definition 
for a predicate defining strings of length m will affect the definition of all strings 
of length greater than m. Table shows the program for the binary case - that 
is, n = 2 where I — 10. 

The maximum string length I was determined by restricting the total number 
of clauses in the logic program to be 20. For the cases n = 2, 4, 5, this gave values 
I = 10, 5, 4 respectively. For the case n = 3, the value I = 7 was chosen, and then 
one of the definitions for p7/l was excluded from the program. 

3.2 Method 

In order to be able to encapsulate the entire success set of the logic program, 
a meto-predicate ss/1 was defined. For every predicate pred(X) defined, the 
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pi (zero) . 
pi (one) . 

p2(zero(A)) pi (A). 

p2(one(A)) pl(A). 

plO(zero(A)) p9(A) 

plO(one(A)) p9(A). 



Base 

n 


1 


Clause 

information 


2 


10 


7.64 


3 


7 


7.20 


4 


5 


6.64 


5 


4 


6.32 



Table 1.1: The program 
for base n = 2, / = 10. 



Table 1.2: Information content (in 
bits, to 2 d.p.) of a single clause in 
each base. 



clause ss(pred(X)) pred(X) . was added to the program. This meant that 
calling ss/1 would then return all the ground facts provable in the original logic 
program. 

In the learning sessions under Progol 5.0, ss/1 was the target to be learned, 
and the incomplete logic program is given as background knowledge for ss/1. 
The examples given were of the form ss(fact) where fact is a ground fact 
that should be provable by the original complete logic program. Logical back- 
propagation in Progol 5.0 uses these examples to complete any missing predicates 
in the background knowledge. Notice that only positive examples were used. 

A complete program can be used to generate the set of all base-n-strings up 
to a certain length. An incomplete or partially repaired program will generate 
only a subset of these strings. Therefore at each stage we were measuring the 
coverage accuracy of the program. 

Each run of the experiment was parameterised by two parameters: p, the 
percentage of the logic program that was deleted, and m, the number of examples 
seen by Progol 5.0 in order to reconstruct the missing predicates. 

A run proceeded as follows: 

— The original program (defined in the background knowledge) has p percent 
of its clauses deleted. 

— Measure the accuracy of the program with depleted background knowledge. 

— Generate m random examples of the success set of the complete program. 

— Run Progol 5.0 with the incomplete program as background knowledge using 
the generated examples. 

— Measure the accuracy of the repaired program. 

Each run was repeated 10 times, for each of the possible combinations of va- 
lues of p G {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} and m G (10, 20, 40, 80, 120, 160}, 
for each of the original logic programs in base n G (2, 3, 4, 5}. 



3.3 Theoretical Results for the Domain 

The theoretical bound ^requires that one estimates the size in bits of the infor- 
mation content of the missing predicates. 

In reconstructing a background clause in base n, Progol 5.0 has a choice of a 
certain number, I, of predicate symbols for the head, and a choice of the same I 
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Average Predictive Accuracy (10 Runs); base n=4 




Fig. 1. Experimental result for the case n = 4, m = 80. Complete results are shown in 
appendix 



predicate symbols for the one atom in the body. It also has a choice of n function 
symbols to add to the string, (one, two, etc). 

If it is assumed that all of the possible choices are assigned equal probability, 
then the information content of a single clause of this kind is therefore info(n) = 
log 2 (n/^). The values of this function for the different possible values of n are 
given in Table II .21 

The bound on the expected experimental error is then: 



EE(m) < 



1.51 + 21n(77i) + c.info(n) ln(2) 
m 



where c is the number of missing clauses. 



3.4 Results 

A typical result is shown in Figure Q This graph shows the case that n = 4, m = 
80. The lower curve is the accuracy of the incomplete logic program before 
theory recovery. The upper curve is the accuracy of the logic program after 
theory recovery using m examples. The straight line is the theoretical bound. 

As the graph shows, the bound fits the results well, and runs parallel to the 
observed experimental accuracy. The results for other values of n (the base), and 
m (the number of examples) are similar. Results for all the values of n and m 
are given in Appendix 0 

The experiments are consistent with the theoretical bound, and the calcula- 
tion of the information content of clauses in this domain (see Table O). 
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4 Conclusions 

This paper has introduced a theoretical model for analysing learning when per- 
forming theory recovery. We derived an average case error bound for the error 
of a Progol-like learner in such a situation and showed that the bound held 
and was reasonably tight under experiment. The experiments used logical back- 
propagation, a feature of Progol 5.0, to perform theory recovery. 

This work is part of a wider programme to analyse multiple-task learning 
within a relational (ILP) setting. In particular, we analyse an issue raised by the 
repeat learning framework introduced in [Q, that of learning under an incorrect 
prior distribution. The experiments differ from those conducted in the previous 
paper in that theory recovery, rather than predicate invention, is used to alter 
background knowledge. However the repeat learning framework does not specify 
the particulars of how the linguistic bias of the learner is to be altered. Alt- 
hough there is no multiple-task learning in this work - we are only ever learning 
one concept - the link with repeat learning comes in the form of the updating 
of background knowledge and hence the updating of the linguistic bias of the 
learner. In both models the learner is missing predicates in the background kno- 
wledge. The analysis could easily be extended to the case when one is learning 
more than one task. This would be the natural direction in which to extend the 
research. 
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B Progol 5.0: Theory Recovery by Logical 
Back-Propagation 

The theory recovery mechanism that is applied to the experiments in the paper is 
provided by logical hack-propagation a form of generalised inverse entailment 
The problem specification for ILP is: given background knowledge B and 
examples E] find the simplest consistent hypothesis H s.t. B f\ H \= E. In the 
case that both E and El are single Horn clauses then by inverse entailment it 
is possible to generate the conjunction of all ground literals that are true in all 
models oi BAE, denoted by T (i.e. BAE ^ T ^ H). In logical back-propagation 
examples of an observational predicate are used to augment the definition of a 
related theoretical predicate. In the following, examples of sentences (predicates 
for s) are used to augment a definition for noun phrase (np). 

Example 1. Natural language processing. Given background knowledge 

f s{A,B) ^ np{A,C),vp{C,D),np{D,B) 

\ np{A, B) ^ det{A, C),noun{C, B) 

example E = s{[the, nasty, man, hit, the, dog], []), and a prior hypothesis H = 
np(A, D) <— det{A, B), adj{B, C), noun{C, D) then by inverse entailment 

T = —>s{[the, nasty, man, hit, the, dog], []) 

A^np{[the, nasty, man, hit, the, dog], [hit, the, dog]) 

Adet{\the, nasty, man, hit, the, dog], [nasty, man, hit, the, dog]) 

A ... A np{[the, dog] , [] ) 

The most specific (non-definite) clause that results from variablising terms (using 
guided y mode declarations) is 

T = s(H, B); np{A, C) <— det{A, D),adj{D, E),noun{E, C), vpiC, F),np{F, B) 

The generation of T in the above example requires derivation of and ~^vp, 
which leads to obvious difficulties when using a Horn clause theorem prover. 
To overcome this the implementation of logical back-propagation makes use of 
mechanisms from Stickel’s Prolog Technology Theorem Prover Q. Clauses are 
constructed to provide definitions for negated literals. For example, for the clause 
p <— g, the eontrapositive clause ~^q <— ~^p is added, allowing the possibility of 
~^q being derived using a Prolog interpreter. 

Not all clauses of a theory need have their contrapositive added when im- 
plementing generalised inverse entailment. A relevance map based on the calling 
diagram among predicates is used to determine the additional contrapositives 
required. The contrapositive required to generate T for the example above is 
~^np{A,C) <— -^s{A,B),vp{C,D),np{D,B). This enables the derivation of ~^np 
for the generalisation of T. The theoretical and observational predicates invol- 
ved in the generalisations are communicated by the user to Progol 5.0 by way 
of mode declarations. 



C Proof of Theorem [21 

Proof. For all Hi G Tip, 
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. , , , UC IAj 

Drm = ^ < 



02'= 



Since j < 2^/^ i. 



So for the learner’s ordering over hypotheses : 



a2^ 






j>n+l 



/■°° a2'= , 


a 2 ^' 






Dj>n J 


. i . 



02^= 

n 



The proof now proceeds in a similar manner to the proof of Theorem Q given in 

m 



EE(m) = E DAT) E Dx{E\T)Error{L{E),T) 

Ten EeT'^ 

< ^ DAT) Y. Dx{E\T)Error{HAE),T) + E ^r(T).l 

TeH'„ EgT'^ T^nyH'.^ 

„ofe 

< ^ DAT) Y Dx{E\T)Error{H'AE)A) + — 



Let Tmn{e) = {E' :E^ GT^ and Error{H'^{E^),T) < e}. Then: 

H DriT) J2 Dx{E\T)Error{Hl,{E),T) 

= E ^r{T) E Dx{E\T)Error{HAE),T) 

Ten'„ -EeT^„(e) 

+ E ^r{T) Y Dx{E\T)Error{HAE),T) 

Ten!^ EGT^'Xr^ny) 

< e + Pr{3H G : Error{H, T) > e and cci , . . . , Xm G T C\ H) 

< e + ne"'’"* 



Thus, 



EE{m) = e + ne 



a2^ 

n 



Optimal values of n and e are found by successively setting to zero the partial 
derivatives of n,e and solving. This gives e = and n = 2^am. Substituting 

gives 



. 2 + /cln2 + 21nm + Ina 

EE[m) < 



< 



1.51 + 21nm + fcln2 



771 



771 
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Abstract. The principles of instance based function learning are pre- 
sented. In IBFL one is given a set of positive examples of a functional 
predicate. These examples are true ground facts that illustrate the in- 
put output behaviour of the predicate. The purpose is then to predict 
the output of the predicate given a new input. Further assumptions are 
that there is no background theory and that the inputs and outputs 
of the predicate consist of structured terms. IBFL is a novel technique 
that addresses this problem and that combines ideas from instance based 
learning, hrst order distances and analogical or case based reasoning. We 
also argue that IBFL is especially useful when there is a need for handling 
complex and deeply nested terms. Though we present the technique in 
isolation, it might be more useful as a component of a larger system to 
deal e.g. with the logic, language and learning challenge. 



1 Introduction 

Within the field of machine learning, both instance based learning [1] and induc- 
tive logic programming (or relational learning) [10,9] are important subfields. 
In instance based learning one tries to predict classes of examples by comparing 
these examples to other examples from a training set. Most often there is only a 
finite and small set of discrete classes, and the class can be regarded as a discrete 
attribute of the examples. Instance based learning (and concept-learning) can 
be regarded as function learning, where the function takes as input the example 
description and produces as output the value for the class attribute. 

In recent work within the field of inductive logic programming classical propo- 
sitional instance based learning techniques have been upgraded towards a first 
order framework. A prominent example of this approach is the RIBL system [2]. 
RIBL (and other approaches along this line) take as input a first order descrip- 
tion of an example and predict the value for the corresponding class attribute. 
In terms of function learning, RIBL maps complex inputs (example descriptions 
in first order logic) to simple outputs (the value of the class attribute). From the 
viewpoint of function learning, one may wonder whether it would be possible to 
extend this framework to learn more complex functions that would map complex 
inputs onto complex outputs. Within the framework of computational logic, the 
inputs and outputs would then naturally correspond to deeply structured terms 
(cf. e.g. [4]). The technique of IBFL, presented in this paper, addresses pre- 
cisely this problem. IBFL starts from a set of positive examples of a functional 

S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 268-278, 1999. 

© Springer-Verlag Berlin Heidelberg 1999 
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predicate p. The examples are thus of the form p{in, out) where in represents a 
possibly complex input term and out the corresponding output terms. IBFL is 
then given a new input terms in' for which it has to compute the corresponding 
output out' such that p(in' , out') holds. IBFL is not given any other information 
and thus operates with an empty background theory. 

This IBFL setting extends the classical hrst order instance based learning 
framework. It also provides a framework for studying the issues involved in han- 
dling deeply structured terms, which is known to be one of the hard problems 
in inductive logic programming (cf. e.g. the Project Programme of the ESPRIT 
IV project on Inductive Logic Programming II). The hardness of this problem is 
illustrated by the fact that well-known ILP systems are unable to produce good 
results on the setting mentioned above due to 1) the combinatorics involved in 
handling deeply structured terms, and 2) the fact that the background theory 
in IBFL is empty, which makes it necessary to induce recursive hypotheses for 
dehning p. Recursion (and program synthesis) is one of the other known hard 
problems in ILP. Despite the difhculties involved, many potential applications 
need to deal with structured terms. One such application is the recent Logic, 
Language and Learning challenge (cf. [8]) issued by Kazakov, Pulman and Mug- 
gleton. We hope that the IBFL framework and its techniques will contribute to a 
better understanding and possible solutions of these problems. IBFL will there- 
fore also be illustrated on (simple) tasks that are relevant to the LLL challenge 
as well as to program synthesis. 

This paper is organized as follows. In section 2, we review some important 
aspects of inductive logic programming and instance based learning. In section 
3, we state the problem specihcation. In section 4, we present the base idea of 
instance based function learning. In section 5, we extend this method with a 
recursive component that allows to learn the translations of parts of terms. In 
section 6, we present some experiments, and hnally, in section 7, we give some 
conclusions and possibilities for further work. 



2 Preliminaries 

A substitution 6 = {Xi/Yi , . . . , V„/y„} is a set of elements Xi /Yi such that Xi 
are variables and Ij are terms. If t is a term and 6* is a substitution, then we 
can apply 6 to t: tO is the term t in which each occurrence Xi is simultanuously 
replaced by Ij. e.g. if 6* = {X/ f{a),Y/X} and t = f{X,Y,g{X)), then t9 = 
f{f{a),X,g[f{a))). If an inverse substitution 0~^ is applied to a term s, the 
result is the set of terms t such that td = s. e.g. if 6* = {Xja,Yj f(a), Zja} 
and t = g{fif{a)),a), t0~^ = {g{f{f{X)),X), g{f{f{X)),Z), g{f{f{Z)),X), 
g{f{f{Z)), Z), g{f{Y), X), g{f{Y), Z)}. We will use the notion of least general 
generalization [12]: a term ti subsumes (or is more general than) a term t 2 iff 
there is a substitution 6 such that ti6 = t^. The subsumes relation induces a 
partial order on the set of terms. The least upper bound under subsumption is 
called the least general generalization (Igg). 
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We also use the notion of position as defined in [6]. Positions are sequences 
of positive integers (e.g. [2,3,2]). e denotes the empty position, and • the con- 
catenation operation on positions. With t a term or atom the sub-term of t at 
position u, t/u is dehned as follows: 

— If t is a term, then t/e = t. 

— if t = /(ti, . . .,t„), then t/{i • u) = ti/u. 

In instance based learning a distance is needed between the examples. As 
we work with structured terms as examples, we must use a distance between 
terms such as the distances dehned in [11], [7], [13], [14]. In this paper we use 
the simple distance dehned in [11]: 

Definition 1 (distance dnc between terms). Ifti and t 2 are terms, then 
// 1 1 — 1 2 ? — 0 - 

— if ti = p{xi, . . . , x„) and t 2 = q{yi , ■ ■ ■ , iJm) with p fz q or n m, then 
dnc{ii, i'j) = f- 

— ifti = p{xi, ...,x„) and t 2 = p{yi, . . . , t/n), then d„c{ti,t' 2 ) = 

2n = l dnc{xi, yf) ■ 

e.g. we can compute dnc{p{f{a),g{h{c),d)),p{f{b),h{e, d)) = ^(dnc(f(a),f(b)) + 
d„c(y(h(c),d),h(e, d))) = b(bd„c(a, b) + 1) = i(i.l + 1) = | 

3 Problem specification 

In this section we state the problem we want to solve. The unknown target 
predicate represents a function. This means that the examples are ground facts 
about the input-output behaviour of the predicate. In terms of ILP we have the 
following problem setting: 

Given is: 

— an unknown functional target predicate p. 

— a set of positive examples (ground facts) for p. In the following we will denote 

the training examples with Ei = p{Ei in,Ei out) with the input term 

and Ei^out the output term. 

— an empty background theory. 

— an input term W„- 

Find: 

— an unknown output Ngut such that p{Nin, Ngut)- 

Suppose e.g. we have an example p[in[circle, square), in[square, circle)), 
and have to find the output of in[circle, triangle) (see figure 1). The program 
could then answer inftriangle, circle) . This is also an example of learning by 
analogy (cf. [3]). 
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4 Instance based function learning 

In the attribute-value setting, the process of predicting a class for an example 
with the fc-nearest neighbours method can be divided in three important steps. 
First, the example input is compared using a distance with all the training 
example inputs and the k nearest examples are selected. Second, from each of 
these k nearest examples the class information is extracted. Third, from these k 
classes the most frequent class is selected and used to predict the example. 

Predicting a whole term at once is difhcult. For this reason we will only 
predict one functor at a time. Algorithm 1 is the basic algorithm for this. 

We start with a variable (a completely uninstantiated term) as predicted out- 
put and we repeatedly apply a substitution of the form V ar / f[V ar\, ..., Var„) 
to add functors (with variable arguments) until we obtain a ground term. In 
practice, since it is not always guaranteed that all positions will eventually be- 
come ground (by hlling them with functors with arity 0 (constants)), a stopping 
criterion could be used. A good stopping criterion is restricting the maximum 
number of functors in the predicted output term. 

Functors are added using the instance based learning component whose three 
components we discuss below. We know that algorithm 1 provides a partially 
instantiated term and a position pos such that the sub-term /pos 

at position pos of is a variable. Thus the task of this instance based 

learning component is to (partially) instantiate /pos, so we get a better 

approximation of our prediction of Ngut- 

First we need to measure the distance between the new example and the 
training examples. We can use for this the distance dnc given in section 2. 

After measuring the distances between the input of the new example and 
the inputs of all training examples, we can select the k nearest relevant training 
examples. Whether an example is relevant or not is determined by the possibility 
of the second step to extract useful information (functors) from that example, 
as will soon become clear. 

In the second step, we must extract information about which functor to 
predict from the k nearest (relevant) examples. The way this is done is important 
as this causes the method to be able to transform the structure of terms. 

Suppose we must extract information from training example Ei. First the 
least general generalization under subsumption Gin = lgg(A'j,m, A^n) of the in- 
put terms Ei^in and is computed, together with substitutions 6g and On 

such that GinOe = and GinOn = Nin (see also hgure 2). Then, we con- 



is to / 1^ I as 





Fig. 1. An example where IBFL could be used to learn by analogy 
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sider the set S = EioutOg^On. Because of the inverse substitution this set 
can contain several elements. Let then be the subset of S of all terms t 
of S such that the functors along the path from the top functor to pos of t 
and Nout are identical. E.g. if pos = [2,1,1] and Ngut = c{d{f, g), e)) and 
S = {a{x,c{d{g,z),e)),a{b,q{d{h,g),e))},then = {a{x , c{d{g , z) , e))} (see hg. 

3). 




Fig. 2. Step 2 of the instance based learning process 



Let then S 2 = {t/pos\t G S'!}. In the above example with pos = [2,1,1], 
Si = {a{x, c{d{g, z), e))}, S 2 = {(/}. Note that if t/pos is a variable (which will 
be possible after the extensions of section 5), this does not add functors to 82 - 
If S 2 is not empty, then the example is relevant. 

So for each relevant example a multi-set of functors is obtained. In the third 
step then the most frequent occurring functor is predicted. 

We illustrate this with a small example. We suppose 1-nearest neighbours is 
used for simplicity. 

Example 1. Suppose we have the following training examples: 

p(swap(a,b) , swapped (b, a) ) . 
p(dontswap(c,d) ,notswapped(c,d)) . 

We want to know the output of swap[g, h). First set the prediction = Var. 
Wehrst measure the distances dnc{swap[g , h), swap[a, b)) = i and dnc{swap[g , h), 



Algorithm 1 Base iteration 

predict _t erm {N,n) 

Let Nggf = A_N ew J/ ariable , 

while not(ground(A*„t)) and not{stoppingcriterion) do 

Let Pos be a position such that V = Nout/ Eos is a variable, 
predict _best Junctor _at_pos(Wn,iV*„t, pos , Functor / Arity) , 

V <— V9 with 9 = {V/ Functor {N ewvar I , NewvarAnty)} 

return 
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Fig. 3. A and B have equal functors along the path to [2, Ij 1] 



dontswap[c,d)) = 1. So p{swap{a,h), swapped{h, a)) is the nearest example. 
Then the Igg is Gi„ = swap{X,Y) and Og = {X/a,Y/h} and 0„ = {X/g,Y/h}. 
Next, swapped[h, a)0~^0„ = swapped(Y, X)0„ = swapped[h, g) . The main func- 
tor of Nout is predicted to be swapped/ 2 and we get a more instantiated pre- 
diction = swapped{_, This is not yet ground: In swapped{_, _) positions 
[1] and [2] are still variables. For this select [1] as the position to be instantiated 
and again select the nearest training example, recompute Gin, Og, 0n which in 
this case all obtain the same result. Again swapped[b, a)0~^0n = swapped[h, g) , 
so we predict = h and = swapped{h, _). Iterating a third time we 

get = swapped{h, g). This is fully instantiated, so we predict swapped[h, g) 
as the output term. 

5 Recursive IBFL 

The method described in the previous section works fine when the main problem 
is to transform structure. However, we also want to do more complex things. It 
can happen that too many examples become irrelevant as the predicted output 
term becomes more and more instantiated. Consider following training examples: 

Example 2. 

p(dog, chien) . pChouse ,maison) . p(near ,pres) . 

p(cat,chat). pCschool , ecole) . p(in,dans). 

p(place(near, school) , lieu(pres , ecole) ) . 

p(walks(dog,place(in, school)) ,court(lieu(dans, ecole) , chien)) . 

p(walks(cat,place(near, school)) ,court(lieu(pres, ecole) ,chat)) . 

p(walks(cat,place(in, house)) , court (lieu(dans ,maison) ,chat)) . 

p(sleeps(cat,place(near, school)) ,dort(lieu(pres, ecole) ,chat)) . 

This example concerns a small translation problem. We would like to pre- 
dict (i.e. translate) the output of sleeps[dog ,place[in, house)) . However, if we 
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apply the method from the previous section, at some point we get = 

dort{lieu{_, and pos = [1,2]. Now there are no good and relevant exam- 
ples: e.g. for p{walks{dog, place{in, school)), court{lieu{dans, ecole), chien)), we 
get G = X,9e = {X/walks{dog,place{in, school))}, 

0„ = {X ! sleeps[dog ,place[in, house))} and cour-t{lieu{dans,ecole),chien)0~^0„ 
= court{lieu(dans, ecole), chien), the main functor is here court while the main 
functor of is dort, so we can’t use this to predict the hrst argument 

of dort. On the other hand, for the example p{sleeps{cat,place{in, school)), 
dort{lieu(dans, ecole), chat)), we have G = sleeps[X,place[in, Y)), Bg = {X/cat, 
Y ! school} , On = {X ! dog ,Y ! house} , and dort[lieu{dans, ecole), chien)0~^0„ = 
dort{lieu(dans, ecole), chien), and so ecole would be predicted. 

This is not what is intended as we can intuitively see that the result of the 
prediction should optimally be dort{lieu(dans, maison), chien). 

One solution to this problem is to allow the system to learn from the solution 
of subproblems (subterms). In the above example, this would mean that we let 
the system use the ’lexicon’ that gives the translations for sub-terms. 

To achieve this, we will, before using the instance based learning system of 
the previous section, add information to the input terms of the examples. An 
example is given in hgure 4. This causes the substitutions that are essential in 
the prediction process to contain as much information as possible. The algorithm 
translates now all the subterms of the input term before translating the input 
term. The translation thus proceeds in a bottom-up fashion, starting at the 
leaves of the tree and gradually working on larger terms. 




Fig. 4. Additional information for the input term to improve the substitutions 



Example 3. What is the effect on example 2? 

We can now translate ,sleeps{dog,place{in, house)): First, the input term 
,sleeps{dog , place{in, house)) is extended with info on the sub-terms: we there- 
fore hrst need to solve two other problems: Given input term dog, what is the 
output term? Given input term place{in, house), what is the output term? 
We have a training example which gives the translation for dog. We trans- 
late place{in, house) recursively: we have training examples for p(in, dams) and 
p(house, maison), so we get p{place{in, house), lieu{dans, maison)). So in our 
original problem, the extended input term is {sleeps(dog, house), sleeps[\[chien , 
dog), X{lieu{dans, maison), place{X{dans, in), X(maison, house))))). Now the 
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instance based algorithm is applied. In the hrst step we get as preliminary predic- 
tion dort{_, _). Suppose we now hll in the hrst argument. p{sleeps{cat, place{in, 
house)), dort[lieu[dans , maison), chat)) is a near and relevant example for this. 
We extend the input term of this training example. We get [sleeps[cat , place[in , 
house)), sleeps{X{chat, cat), X{lieu{dans, ecole), place{X{dans,in), X{maison, 
house))))). We get as generalization of the input terms G = {sleeps(X, house), 
,sleeps{X{V, X), X{lieu{dans, maison) , place{X{dans , in), X(maison, house))))) 
with 6e = {X/cat, V/chat} and = {X/dog, V / chien} . Finally, applying these 
same substitutions to dort{lieu(dans, ecole), chat), we get dort{lieu(dans, ecole), 
chat)0~^0„ = dort[lieu[dans , W), V)0n = dort[lieu[dans , maison), chien) which 
was the correct translation. 

This kind of experiment is difhcult for classical ILP systems such as Progol. 
We tried to run this example with Progol but ran into problems of too large 
search space due to the exponential number of terms (in the size of the examples) 
in the bottom clause^. 

6 Experiments 

We implemented the method described in the previous sections. In this section, 
we summarize the result of some experiments. 

Experiment 1 Another application where IBFL could be useful is the learning 
of parsers, e.g. Given an expression such as -l-(2, *(4, 5)), we would like to be 
able to execute it (compute the result) on a stack-machine. Therefore we need a 
program such as \push{2),push{4:),push{b), *, -f] which computes the expression. 
The IBFL system can learn to do this conversion. Given the following examples: 

p(3+(4+2) , [push(3) , [push (4) .push (2) ,+],+]). 
p(5+6, [push(5) ,push(6) ,+]) . 
p(8+9, [push(8) ,push(9) ,+] ) . 
p(ll,push(ll)) . 

The system can correctly predict 

p((ll+12) + (13+14) , [[push(ll) .push (12) .+] . [push (13) .push (14) .+].+]). 

Then after removing the redundant brackets in the output we obtain the desired 
program. This is possible independently of the complexity of the expressions to 
convert, as long as sufficient training examples are given, including simple cases 
(from which the system can learn most unambiguou,sly). 

Experiment 2 We want to learn the induction step of the predicate reverse 
(see algorithm 2). We can learn this from partial execution traces. The system 
was given a set of training examples: 

^ James Cussens, personal communication 
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p(reverse( [c,d,e] , [b,a] , Result) ,reverse( [d,e] , [c,b,a] , Result)) . 
p(reverse( [a,b] , [] , Result) ,reverse( [b] , [a] , Result) ) . 
p(reverse( [] , [z,y,x] , Result) ,Result=[z,y,x] ) . 
p (reverse ( [r , s ,t ,u, v] , [q,p] , Result) , 

reverse ( [s ,t ,u, v] , [r,q,p] , Result) ) . 

The p predicate has as first (input) argument the predicate to be executed. 
The second (output) argument represents what must be done for that. There are 
two kinds of execution steps: If the output term of p is a simple operation ( a 
unification of the output term with the correct result), this can be done at once. 
This IS the case if the first argument in the call of reverse is [], If the output 
term of p is a new call to reverse, we can again execute this call by predicting 
the output of p when this new call is given as input, and so on. 

The input of some other steps were given. The system predicted the following 
outputs correctly: 

p (reverse ( [f ,g,h, i] , [e ,d, c] , Result) , 

reverse ( [g,h, i] , [f , e,d, c] , Result) ) . 
p(reverse( [x,y,z] , [] , Result) ,reverse( [y,z] , [x] , Result)) . 
p(reverse( [] , [z,y,x,w] , Result) ,Result=[z,y,x,w] ) . 

The following observations can be made: first, the system could correctly dis- 
tinguish between the base case ( where the first argument is []^ and other cases. 
Second, the only test examples whose output term was predicted incorrectly were 
the examples with a first argument at least as large as the largest first argument 
in the training examples, e.g. 

p(reverse( [r,s,t,u,v] , [q] , Result) ,reverse( [s,t,u, []] , [r,q] , Result)) 

One can conclude from this that the system can learn the induction step of the 
reverse predicate, but that further work could make it possible to generalize to 
larger inputs. 

When an induction step is learned, this induction step can be applied several 
times (i.e. the predicted output can be used as input of the next step). If we do 
this until the result does not contain any further call to reverse, we can compute 
for all inputs the result of the reverse predicate. This means we can learn the 
reverse predicate from partial execution traces.. 

7 Conclusions 

We have presented an approach that extends instance based learning to the 
learning of complex functions from terms to terms, we have argued that IBFL is 



Algorithm 2 Reverse 

re verse ([X I Y],Z, Result) 

re verse (Y, [X | Z] , Result ) . 
re verse ([],Z, Result) Z=Result. 
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relevant to the problem of handling deeply structured terms and recursion, and 
we have illustrated the potential use of IBFL in natural language applications 
and program synthesis. 

We expect that the approach can best be used as a component of a larger 
system that might use IBFL to compute e.g. the output structure of a clause in 
a program synthesis application. One current limitation of the technique is its 
inability to incorporate background knowledge. It may be possible to incorpo- 
rate background knowledge in the process in similar ways as in RIBL and the 
framework by Flach et al [4]. This framework encodes examples (and background 
theory) as structured terms and may therefore be well-suited for IBFL. 

IBFL is related to the instance based learning work in ILP (see [1]) it can be 
considered a form of analogical reasoning (see [3], [5], [15]). Indeed, in analogical 
and case based reasoning one maps a target problem onto a target solution 
by using similarities with a source problem with known solution. In IBFL the 
target problem is the new input term, and the source problems and solutions 
correspond to the examples of the input-output behaviour of the function. 
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Abstract. This paper studies the properties of inverse resolution in nor- 
mal logic programs. The V-operators are known as operations for induc- 
tive generalization in dehnite logic programs. In the presence of negation 
as failure in a program, however, the V-operators do not work as ge- 
neralization operations in general and often make a consistent program 
inconsistent. Moreover, they may destroy the syntactic structure of logic 
programs such as acyclicity and local stratification. On the procedural 
side, unrestricted application of the V-operators may lose answers com- 
puted in the original program and make queries flounder. We provide 
sufficient conditions for the V-operators to avoid these problems. 



1 Introduction 

Inverse resolution introduced in H2I is known as operations which perform in- 
ductive generalization in definite logic programs. There are two operators that 
carry out inverse resolution, absorption and identification, which are called the 
V-operators together. Each operator builds one of the two parent clauses given 
the other parent clause and the resolvent. More precisely, absorption constructs 
C 2 from Cl and C 3 , while identification constructs Ci from C 2 and C 3 in the 
figure (where lower-case letters are atoms and upper-case letters are conjunction 
of atoms). 



Cl : q ^ A C2 ■■ p ^ q, B 




Absorption and identification are realized in Duce [I ill I j for propositional 
Horn theories, and a restricted version of absorption is implemented in CIGOL 
for predicate Horn theories. 



S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 279-|29^ 1999. 
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The V-operators are considered as program transformation rules. That is, ab- 
sorption transforms the set of clauses { Ci, C3 } to { Ci, C2 }, while identification 
transforms { (72, C3 } to { (7i, C2 }■ In logic programming, program transforma- 
tion is an important technique for program development, especially for deriving 
an efficient program preserving the meaning of the original program. In the con- 
text of inductive logic programming (ILP), program transformation is used for 
a different purpose. In ILP the original program represents an imperfect back- 
ground theory, and it is transformed to a more general/specific program which 
covers given positive/negative evidences. The V-operators are used as program 
transformation rules which generalize definite Horn logic programs. 

When a program is nonmonotonic, however, the behavior of the V-operators 
is not clear. The importance of nonmonotonic reasoning in commonsense in- 
ference is widely recognized, and many studies have been done to formalize 
nonmonotonic reasoning in logic programming [3. In logic programming, non- 
monotonic reasoning is realized using negation as failure, and a program with 
negation as failure is called a normal logic program. Then, our primary interest 
is the semantic nature of the V-operators in normal logic programs. 

This paper investigates the properties of inverse resolution in normal logic 
programs. In the presence of negation as failure, we show that the V-operators 
do not work as generalization operators in general and often make a consistent 
program inconsistent. Moreover, the V-operators destroy the structures of logic 
programs such as acyclicity and local stratification. On the procedural side, it is 
shown that unrestricted application of the V-operators may lose answers com- 
puted in the original program and make queries flounder. We provide sufficient 
conditions for the V-operators to avoid these problems. 

The paper is organized as follows. Section 2 reviews the logic programming 
framework considered in this paper. Section 3 shows the declarative properties 
of inverse resolution in normal logic programs. Section 4 argues the effects of the 
V-operators on query-answering. Section 5 presents related issues and Section 6 
concludes the paper. 

2 Normal Logic Programs 

A normal logic program is a set of rules of the form: 

p ^ qi, Qm, ^Qm+l, ■■■, (1) 

where p and Qi {1 < i < n) are atoms and ^ presents negation as failure. 
Throughout the paper a program means a normal logic program unless stated 
otherwise. The left-hand side of the rule is the head, and the right-hand side is the 
body. A rule with an empty body is a fact. A program P is definite if no rule in P 
contains negation as failure. A program, a rule or an atom is ground if it contains 
no variable. A program P is semantically identified with its ground instantiation, 
i.e., the set of ground rules obtained from P by substituting variables in P by 
elements of its Herbrand universe in every possible way. 
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A partial order relation > is defined over the Her brand base HB of a program 
such that: if p > q then p is in a level higher than or equal to q. A program 
P is locally stratified H3] if (i) for any ground rule of the form m from P, 
P ^ qi (i = 1, . . . , to) and p > qj (j = to + 1, . . . , n); and (ii) there is no infinite 
sequence such as pi > P 2 > ■ ■ ■■ Here, p > q if p > q and q^p. A program P is 
acyclic |2| if (i) for any ground rule of the form m from P, p > qi (i = 1, ... ,n); 
and (ii) the same condition as above. By definition, the class of acyclic programs 
is strictly included in the class of locally stratified programs. 

An interpretation I (C HB) satisfies the conjunction C = qi, . . . , qm, ^ qm+i, 
...,^q„ if {qi, ..., qm} Q I and {qm+i ,- . . , g„} H / = 0 (written as / ^ C). I 
satisfies the rule (P) if {gi, . . . , qm} C I and {qm+i, ■ ■ ■ , qn} fil = % imply p G I. 
An interpretation I which satisfies every rule in a program P is a model of the 
program (written as / ^ P). A model / of a program P is called supported Q 
if for each atom p in I, there is a ground rule m from P such that I satisfies its 
body. A model / of P is minimal if there is no model J of P such that J C I. 
A definite program P has the unique minimal model which is the least model 
(denoted by LAIp). 

For the semantics of normal logic programs, we consider the stable model 
semantics of |H1 and Clark’s completion jjj. Given a program P and an inter- 
pretation I, the ground definite program P^ is defined as follows: a ground rule 
p <— gi , . . . , ( 7 m is in P^ iff there is a ground rule of the form GD in the ground 
instantiation of P such that {qm+i , . . . , qn} H / = 0. If the least model of P^ 
is identical to I, I is called a stable model of P. A program may have none, 
one, or multiple stable models in general. In a definite program, a stable model 
coincides with the least model. A locally stratified program has the unique sta- 
ble model which is called the perfect model. A program is consistent (under the 
stable model semantics) if it has a stable model; otherwise it is inconsistent. For 
a consistent program P and an atom a, if a is included in every stable model of 
P, it is written as P \= a; otherwise P ^ a. 

On the other hand, suppose that a ground program P has k rules p <— 
Pi; . . . ;p <— Pfc defining the predicate p. Then, the completion of a program P 
(written as comp{P)) includes the first-order formula p <-> Pi V • • • V Pfefl In 
particular, when P has no definition of p, p false is in comp{P). We say 
that an interpretation / is a completion model of P if / is a minimal model of 
comp(P). The (in)consistency of a program under the completion semantics is 
defined in the same way as the stable model semantics. 



3 Declarative Properties 

This section investigates the declarative properties of inverse resolution, and 
programs and rules are assumed to be ground. 

^ In this paper we consider the completion of a gronnd program. -i is interpreted as 
classical negation in comp{P). 
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3.1 Semantic Properties 

Absorption and identification are operations such that 

Absorption: 

Input : Cl : q ^ A and : p ^ A, B (2) 

Output : C2 '■ p q, B and C\ 

Identification: 



Input : C2 ■ p ^ q, B and : p ^ A, B (3) 

Output : Cl : q ^ A and C2 



where p and q are atoms, and A and B are conjunctions in the body. Throug- 
hout the paper, we use the symbols Ci, C2, and C3 which refer to rules of the 
above forms. Absorption and identification are called the V-operators together. 
Note that in H31 these operations are introduced to definite programs. Here we 
consider the V-operators in normal logic programs. 

In the ILP literature, there are two cases in the usage of these operations. 
The first case is that the input rule Ci in absorption or C2 in identification is 
included in the background theory P, while the input rule C3 is given as an 
example aside from P (e.g. [ 1 31 1 til l 7j 1 . In this case, the output rule C2 in 0 
or Cl in o is just added to the original theory P. The second case is that the 
input rule C3 is also included in the background theory P as well as the input 
rule Cl or C2 (e.g. |3IU)I1 1) 1. In this case, the input rule C3 in P is replaced by 
the output rule C2 in O or Ci in (00 

In this section, the distinction of two cases is not important. In fact, the 
properties presented in Sections 3.1 and 3.2 hold in either case. Then, we do 
not distinguish the locations of the input rules and consider the second case in 
this section^ Given a program P containing the rules Ci and C3, absorption 
produces the program A{P) such that 



A(P) = (P\{C3})U{C2}. 



On the other hand, given a program P containing the rules C2 and C3, identifi- 
cation produces the program I{P) such that 

/(P) = (P\{C3})U{Ci}. 

Note that there are multiple A{P) or I{P) exist in general according to the 
choice of the input rules in P. 

For notational convenience, we use V{P) which means either A{P) or I{P). 
With this setting, absorption (resp. identification) is captured as a program 
transformation from P to A{P) (resp. I{P))- Then, we first investigate semanti- 
cal relations between the original program P and the produced program V{P). 

^ C3 is derived from Ci and C2, hence it is redundant. 

® The distinction between two cases makes sense in Section 4. We will notice this point 
in Section 4. 




Some Properties of Inverse Resolution in Normal Logic Programs 283 



Proposition 3.1. Let P he a normal logic program. If M is a stable model of 
V{P), M is a model of P. 

Proof. When M is a stable model of V{P), M satisfies C\ and C 2 in V{P). If 
M A, M satisfies C3. Else ii M \= A, q G M hy C\. As M satisfies C 2 , M \= B 
implies p G M. Hence, M satisfies C 3 . Therefore, M is a model of P. □ 

Corollary 3.2 Let P be a definite program and LMy(^p^ a least model ofV(P). 
Then, LMp C LMy(p) Q HB. 

Proof. Since LMyf^p'j is a model of P by Proposition 13. 1 1 the result follows 
immediately. □ 

In the above corollary, when LMy i^pf\LM p ^ 0, any atomp G LMyi^pf\ LMp 
is often called an inductive leap in the literature. 

Proposition rm presents that any stable model M of the produced program 
V{P) satisfies the original program P (i.e., M |= P). Clearly, a stable model M 
of V{P) is not necessarily a stable model of P. Indeed, M is neither minimal 
nor supported in P in general. 

Proposition 3.3. A stable model M ofV{P) is generally neither minimal nor 
supported in P. 

Example 3.1. Let P\ be the program 

p^q, r ^ q, r^. 

Using the first rule and the second rule, absorption produces A{Pi)-. 

p - 1 ^ r, r ^ q, r . 

Here, A{Pi) has the stable model {p, r}, which it is neither minimal nor sup- 
ported in Pi. Next, let P 2 be the program 

p^ q,r, p^ s,r, s^ . 

Using the first rule and the second rule, identification produces /(P 2 ): 

p^q,r, q^s, s^. 

Here, I{P 2 ) has the stable model {q, s }, which is neither minimal nor supported 
in P 2 . 

The V-operators may increase proven facts hence a stable model of V (P) is 
not a minimal model of P in general. On the other hand, absorption generalizes 
the condition of a rule, so that the body of the original rule may not be requested 
to be true for implying the head of the rule in A{P). Also, identification may 
produce a rule having the head with an atom that does not appear in the head 
of any rule in the original program. When a stable model of I{P) contains such 
an atom, it may not be supported in P. 

Next we consider the completion semantics. It is shown that a completion 
model of V{P) is a model of P. 
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Proposition 3.4. Let P he a normal logic program. When M is a completion 
model ofV{P), M is a model of P. 

Proof. Suppose that Ci and C 2 are respectively completed as C[ : q ^ AV Pi 
and C '2 : p ^ q, B \/ P 2 in V{P), where Pi and P 2 are formulas in disjunctive 
normal forms. When M is a completion model of V{P), M satisfies C[ and 
Then, M satisfies p ^ ((.4VA) A B) V I 2 = p ^ (A A i?) V (A A B) V T 2 - Hence, 
M satisfies C 3 : p ^ A,B. Therefore, M is a model of P. □ 

A completion model of V (P) is not necessarily a completion model of P, and 
vice versa. 

Example 3.2. Let P be the program 

p^r,s, q^r, q ^ t, s t^. 

Then comp{P) becomes 

p ^ r,s, q^ r\J t, r ^ false, s ^ true, t ^ true, 

which has the completion model {q, s, t}. On the other hand, using the first rule 
and the second rule in P, absorption produces A{P): 

p^q,s, q^r, q^t, s^, t^. 

Then comp{V{P)) becomes 

p ^ q,s, q^ r\J t, r ^ false, s ^ true, t ^ true, 

which has the completion model {p, q, s, t}. 

From a consistent program P, the V-operators may produce an inconsistent 
program V{P). 

Example 3.3. Let P\ be the program 

q^r, s^r, s^, 

which has the stable model {s}. Using the second rule and the third rule, ab- 
sorption produces A (Pi): 

P^q,^P, q^s, s^r, s^, 

which has no stable model. Next, let P 2 be the program 

P^q,^P, r ^ q, r ^ s, s 

which has the stable model {r, s}. Using the second rule and the third rule, 
identification produces /(P 2 ): 



which has no stable model. 
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The same problem happens under the completion semantics, e.g., 
comp{Pi) = {p ^ q, q ^ T, re-> false, s ^ r\/ true} 

is consistent, while 

comp{A{Pi)) = {p ^ q, ^p, q ^ s, r ^ false, s^rVtrue} 
is inconsistent. Therefore, it is concluded that: 

Proposition 3.5. The V-operators may turn a consistent normal logic program 
into inconsistent under both the stable model semantics and the completion se- 
mantics. 

This problem does not arise in definite programs since a definite program is 
always consistent. A sufficient condition for guaranteeing the consistency of the 
produced program V (P) is given in the next section. 

Next we consider the use of the V-operators as generalization operators. We 
say that a program Pi generalizes a program P 2 P 2 \= a implies Pi |= a for 
any atom a. The V-operators generalize a program P when P is definite, but 
this is not the case in the presence of negation as failure in general. 

Example 3.f. Let Pi be the program 

p^^q, q^r, s ^ r, s 

which has the stable model {p, s}. Using the second rule and the third rule, 
absorption produces A (Pi): 

p^^q, q^s, s^r, s 

which has the stable model {<7, s}. Then, Pi ^ p but A(Pi) ^ p. 

Next, let P2 be the program 

p< <r, q^r, q^s, s 

which has the stable model {p,q,s}. Using the second rule and the third rule, 
identification produces /(P2): 

p ^r, q ^ r, r ^ s, s , 

which has the stable model {<7, r, s}. Then, P2 ^ p but /(P2) ^ p. 

The same phenomenon is observed under the completion semantics. In non- 
monotonic theories, newly proven facts may block the derivation of other facts 
which are proven beforehand. As a result, the V-operators may not generalize 
the original program. Note that the above two programs Pi and P2 are (locally) 
stratified, which are the simplest extension of definite programs. 

Proposition 3.6. The V-operators do not generalize a normal logic program in 
general. This is the case even if a program is locally stratified. 
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A simple condition for absorption (resp. identification) to generalize a normal 
logic program P is that for any negative literal ~^a in P, a does not depend on 
p of C3 (resp. q of C2) in 

3.2 Syntactic Properties 

In normal logic programs, syntactic restrictions on a program often introduce 
some nice properties in both the declarative and the procedural aspects. For 
instance, a locally stratified program always has a unique stable model, and an 
acyclic program guarantees termination of a top-down proof procedure. The- 
refore, when considering any program transformation, the preservation of such 
syntactic structures is particularly important to keep those nice properties in the 
transformed program. Unfortunately, the V-operators may destroy the structure 
of both acyclicity and local stratification. 

Example 3.5. Let P\ be the locally stratified (and also acyclic) program 

p^q, r ^ q, r^^p, 

where p > q, r > q, r > p. Using the first rule and the second rule, absorption 
produces A (Pi): 

p ^ r, r q, r^^p, 

where p > r, r > q, r > p. Here p > r conflicts with r > p, hence A(Pi) is 
neither acyclic nor locally stratified. 

Next, let P2 be the locally stratified (and also acyclic) program 

p^q,r, p^q,^s, s ^ 

where p > q, p > r, p > s, s > r. Using the first rule and the second rule, 
identification produces /(P2): 

p ^ q,r, r-^^s, s ^ r, 

where p > q, p > r, r > s, .s > r. Here r > s conflicts with s > r, hence I{P2) is 
neither acyclic nor locally stratified. 

Proposition 3.7. Given a locally stratified (resp. acyclic) program P, V{P) is 
not locally stratified (resp. acyclic) in general. 

We give a sufficient condition for the V-operators to preserve such syntactic 
structures of the original program. 

Proposition 3.8. Let P be a locally stratified (resp. acyclic) program. 

* Here, depends on is a transitive relation defined as: p depends on q if there is a 
ground rule from P s.t. p appears in the head and q appears in the body of the rule. 
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(i) Suppose that the rule C2 is produced from C\ and C3 by absorption of 

If the relation p > q (resp. p > q) holds in P, A{P) is also locally stratified 
(resp. acyclic). 

(ii) Suppose that the rule C\ is produced from C2 and C3 by identification 
ofm- For any positive literal ai in A and any negative literal ^ aj in A, if 
the relations q> ai (resp. q> ai) and q > aj hold in P, I{P) is also locally 
stratified (resp. acyclic). 

Proof, (i) When P is locally stratified (resp. acyclic), for any positive literal bi 
in B of C3 and any negative literal ^ bj in B of C3, the relations p > bi (resp. 
p > bi) and p > bj hold. In addition, the relation p> q (resp. p > q) holds in P 
by assumption, then the rule C2 produced by absorption satisfies the condition 
of local stratification (resp. acyclicity). 

(ii) Since the relations q > Ui (resp. q > ai) and q > Oj hold in P, the rule 
Cl produced by identification satisfies the condition of local stratification (resp. 
acyclicity). Hence, the result follows. □ 

The above proposition implies a sufficient condition which guarantees the 
consistency of V (P) for a locally stratified program P. 

Corollary 3.9 Let P be a locally stratified program. If P satisfies the condition 
(i) (resp. (ii)) of Provosition UIR A{P) (resp. I{P)) is consistent under both the 
stable model semantics and the completion semantics. 

Proof. When P satisfies the condition (i) (resp. (ii)), A{P) (resp. I{P)) is locally 
stratified. Since any locally stratified program is consistent under both the stable 
model semantics and the completion semantics, the result holds. □ 

4 Procedural Properties 

In the previous section, we observed that the V-operators may introduce cycles 
to a program. This means that given an acyclic program P, the completeness 
of SLDNF-resolution in the produced program V{P) is not guaranteed in gene- 
ral. The conditions of Proposition are useful for keeping V{P) acyclic. This 
problem does not happen when a program is definite. When a definite program 
contains variables, however, an application of the V-operators may lose answers 
which are computed using SLD-resolution in the original program. 

Example 4 - 1 . Let Pi be the program 

P{x, y) ^ q{x, y), r{y) ^ q{x, y), q{a, b) 

and A{Pi) the program 

p{x, y) ^ r{y), r{y) ^ q{x, y), q{a, b) ^ . 

Using SLD-resolution, the query <— p{x,b) computes the answer x = a in Pi, 
but the answer is not obtained in A{Pi). Next, let P2 be the program 

p{x, y) ^ r{y), p{x, y) ^ q{x, y), q{a, b) 



288 



C. Sakama 



and /(P2) the program 

p{x, y) ^ r(?/), r{y) ^ q{x, y), q{a, b) ^ . 

Using SLD-resolution, the query ^ p{x, b) computes the answer x = a in P2, 
but the answer is not obtained in I{P2)- 

Thus, unrestricted application of the V-operators does not always extend the 
set of computed answers even in a definite program. Note that such phenomena 
happen only when the input rule C3 is included in the original program P in ® 
and When C3 is given aside from P, the relation P C V{P) holds hence every 
computed answer in P is obtained in V{P). Hence, the following arguments are 
meaningful when C3 is in P. 

We first define the notion of generalization wrt computed answers. Given a 
definite program P and a goal ^ G, we write P \=sld G 9 if the goal has an 
SLD-refutation with a computed answer 

Definition 4.1. Let Pi and P2 be definite programs. Then, Pi is a generaliza- 
tion of P2 wrt computed answers if P2 \=sld G implies Pi \=sld G for any 
atom G. 

The next proposition presents sufficient conditions for the V-operators to 
generalize a program wrt computed answers. 

Proposition 4.1. Let P be a definite program. 

(i) A{P) is a generalization of P wrt computed answers if in the rule Ci 
A, every variable in A appears in q. 

(ii) I{P) is a generalization of P wrt computed answers if in the rule C2 
q, B, every variable in p appears in either q or B. 

Proof, (i) Without loss of generality, we can put Gi : q{x,y) ^ A{x) and 
G3 : p{z) ^ A(x), B{w), where x, y, z, w are vectors of terms. Suppose that 
G2 '. p{z) ^ q{x,y), B{w) is produced by absorption. Resolving G2 with Gi, we 
get the original rule G3 in A{P). Hence, if a query has a computed answer in P, 
the same answer is computed by SLD-resolution in A{P). 

(ii) Put G2 : p{x,y) ^ q{x,z), B{y,w) and G3 : p{x,y) ^ A{u), B{y,w), 
where x, y, z, w, u are vectors of terms. Suppose that Gi : q{x, z) <— A{u) is 
produced by identification. Resolving C2 with Gi, we get the rule Gg : p{x, y) ^ 
A(u'), B(y, w) where u' is possibly different from u. For variables in x, two cases 
are considered, (a) When x and u share no variable in Gg, a; and u' also share 
no variable in Gg. (b) When x and u share variables in Gg, the same variables 
are shared in x and u in Gi and thereby shared in x and u' in Gg. In either 
case, any binding for the variables in x, that is computed using Gg in P, is also 
computed using Gg in I(P). Next, for variables in y, two cases are considered, 
(c) When y and u share no variable in Gg, y and u' also share no variable in Gg. 




® We assume familiarity with basic terminologies wrt SLD-resolution given in 0. 
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(d) When y and u share variables in C3, the variables are possibly renamed in 
u' in C3 but any binding for the variables in y is computed by B{y, w) in Cg. In 
either case, any binding for the variables in y, that is computed using C3 in P, 
is also computed using C3 in / (P) . Therefore, if a query has a computed answer 
in P, the same answer is computed by SLD-resolution in /(P). □ 

The V-operators do not preserve the condition of allowednes^ in general. 
For instance, the program Pi of Example O is allowed but A{Pi) is not. In 
normal logic programs, the condition of allowedness provides a sufficient con- 
dition to prevent a query from floundering [^. Thus, given an allowed normal 
logic program P, a query may flounder in V{P). The condition (i) of Proposi- 
tion ^Sprevents the decrease of variables in the body of Cs, hence it guarantees 
that A{P) is allowed if P is allowed. On the other hand, the condition (ii) is 
insufficient to keep /(P) allowed. Suppose that a program P has the rules 

C2 ■■ p{x,y) ^ q{x,y), s{y), C3 : p{x,y) ^ r{x), s{y), 

which are allowed and C2 satisfies the condition (ii) of Proposition 14 1 1 But 
identification produces Ci : q{x^ y) <— r{x) which is not allowed. Then, the query 
<— q{x,y),^t{x,y) may flounder in /(P). To keep I{P) allowed, it is sufficient 
that every variable in q of C2 appears in a positive literal in A of C3 in P. 

5 Discussion 

There are variants of the V-operators. Muggleton introduces the most specific 
version of the V-operators. The most specific absorption produces the rule : 
p ^ q, A, B, instead of C2 : p ^ q, B, while the most specific identification 
produces the rule : q ^ A, B, instead of Ci : q ^ A. The most specific 
absorption is also called saturation in UH. Saturation alone does not generalize a 
program and is followed by truncation. Truncation drops the conjunction A from 
C'2, which results in C2. Coupling saturation and truncation includes absorption 
as a special case, hence they have the same properties as those of absorption 
presented in this paper. It is easy to see that the properties of identification 
presented in this paper also hold for the most specific identification. The W- 
operators of include the V-operators as a special case, hence they also inherit 
the properties of the V-operators. 

There are few work which considers inverse resolution in normal logic pro- 
grams. Bain and Muggleton m incorporate the closed world specialization tech- 
nique into CIGOL, but they do not provide formal analysis of inverse resolution 
in nonmonotonic theories. Taylor P introduces normal absorption which is dif- 
ferent from absorption in definite programs. Given the input rules p *— q and 
r ^ q, ^s, normal absorption outputs the rule p r, [q, ^s] where [q, ^s] pre- 
sents optional literals which can be dropped at the end. She shows that the 
output rule generalizes the input rule wrt the background theory under normal 
subsumption. However, she does not argue the effect of such normal V-operators 
in a whole theory. 



Any variable in a rule has an occurrence in a positive literal in the body of the rule. 
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6 Summary 

This paper studied the effect of inverse resolution in normal logic programs. We 
posed several problems of the V-operators that may occur in the presence of 
negation as failure, and gave some sufficient conditions to avoid these problems. 
The results of this paper notice that care should be taken when using the V- 
operators as inductive operations in normal logic programs. 
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Abstract. The Predictive Toxicology Evaluation (or PTE) Challenge 
provided Machine Learning techniques with the opportunity to compete 
against specialised techniques for toxicology prediction. Toxicity models 
that used findings from ILP programs have performed creditably in the 
PTE-2 experiment proposed under this challenge. We report here on an 
assessment of such models along scales of: (1) quantitative performance, 
in comparison to models developed with expert collaboration; and (2) 
potential explanatory value for toxicology. Results appear to suggest the 
following: (a) across of range of class distributions and error costs, some 
explicit models constructed with ILP-assistance appear closer to opti- 
mal than most expert-assisted ones. Given the paucity of test-data, this 
is to be interpreted cautiously; (b) a combined use of propositional and 
ILP techniques appears to yield models that contain unusual combina- 
tions of structural and biological features; and (c) significant effort was 
required to interpret the output, strongly indicating the need to invest 
greater effort in transforming the output into a “toxicologist-friendly” 
form. Based on the lessons learnt from these results, we propose a new 
predictive toxicology evaluation experiment - PTE-3 - which will address 
some important shortcomings of the previous study. 



1 Introduction 

Hypothesing “good” models from data is at once one of the most routine and 
challenging of scientific activities. This task assumes a degree of urgency when 
the models are directly concerned with issues of public health and safety. The 
prediction of chemical toxicity provides a case in point. It is estimated that 
approximately 100,000 chemicals are in routine use daily, with 500-1000 new 
chemicals being introduced yearly [8]. Approximately 300 chemical studies (in- 
volving standardised bioassays for toxicity) are commenced world- wide each year, 
with each study taking at least five years to complete. The obvious gulf engen- 
dered between the rate of growth of chemical data and chemical knowledge 
has turned attention to machine-assisted methods of data analysis. The PTE 

* Two figures in the paper and some of the discussion in Section 4 appear in part in a 
submission to a AAAI Spring Symposium on Predictive Toxicology [16] and to the 
Sixteenth International Conference on Artificial Intelligence (IJCAI-99, [17]). 



S. Dzeroski and P. Flach (Eds.): ILP-99, LNAI 1634, pp. 291-302, 1999. 
© Springer-Verlag Berlin Heidelberg 1999 
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Challenge [18] follows the lead of a comparative evaluation exercise undertaken 
earlier by the National Institute of Environmental Health Sciences (NIEHS, see: 
dir.niehs.nih.gov/dirlecm/pte2.htrn} and [5]). The challenge described an exper- 
iment PTE-2, in which carcinogenesis predictions for 30 compounds were to 
be made by models constructed by Machine Learning programs^. This paper 
is concerned with these models. In particular, our focus is on “explicit” mod- 
els - those capable of examination for toxicological insights - constructed with 
ILP assistance. We examine the quantitative performance of such ILP-assisted 
models and provide some assessment of their chemical value. We further provide 
details of a new experiment (PTE-3) that addresses many of the shortcomings of 
the PTE-2 experiment. The paper is organised as follows: Section 2 summarises 
submissions made to the challenge. Section 3 compares quantitatively the ex- 
plicit ILP-assisted models against those developed under the guidance of expert 
toxicologists (this includes toxicology expert systems) . Section 4 contains an ap- 
praisal of the explanatory value of the ILP-assisted models. Section 5 contains 
the proposal for PTE-3, and Section 6 concludes this paper. 

2 Summary of submissions to the PTE Challenge 

All submissions made to the challenge consisted of two parts: (1) prediction: 
“pos” and “neg” classification for the compounds in PTE-2 (standing for car- 
cinogenic or otherwise: see [5] for a further description of the meaning of these 
classes); and (2) description: details of the materials and methods used, and 
results obtained with the technique. The former was needed to assess model 
accuracy, and the latter for replicability of results and evaluations of model 
comprehensibility by a toxicologist. Nine legal submissions^ were received in the 
period between August 29, 1997 and November 15, 1998. These are summarised 
in Eigure 1. 



3 Quantitative Assessment of Models 

At the time of writing this paper, the classification of 23 of the 30 compounds 
had become available. Eigure 2 tabulates the predictive accuracies achieved by 
the models tabulated in Eigure 1. 

The primary focus of this paper precludes any further analysis of the models 
in OAI, and TAl. LRD also poses some concern, as it is still unclear whether 
any ILP assistance was required. No single model is presented as part of the 
description, although the developers appear confident that such a model can 
be obtained"^. Eor comparative purposes, Benigni [2] provides a tabulation of 
the predictions made by several established toxicity prediction methods on a 

^ All Internet sites mentioned in this paper are to be prefixed with http:// 

^ Submissions were received at: www.comlab.ox.ac.uk/oucl/groups/machlearn/PTE 
® Those that contained both “prediction” and “description” parts. 

* M. Sebag, private communication. 
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Model Uses ILP 


Descripti on 


LEI 




WWW. C.S. kuleuven.ac.be hendrik/ PTE / PTEl.html 


LE2 


V 


www.cs.kuleuven.ac.be /~ Idh/PT E / PT E2.html 


LE3 


V 


www.cs.kuleuven.ac.be wimv ! PTE ! PTE2.html 


LRD 


? 


www.lri.fr /~ fabien / PTE / Distill/ 


LRG 


V 


www.lri.fr/'" fabien/PTE/GloBoj 


OAI 


X 


www.ai.univie.ac.at Z'" bernhard/pte2 / pte2.html 


OUl 


V 


www.comlab.ox.ac.uk / oucl / groups / machlearn/PT E j oucll.html 


OU2 


V 


www.comlab.ox.ac.uk/oucl/groups/machlearn/PTE/oucl2.html 


TAl 


X 


ailab2.cs.nthu.edu.tw /pte 



Fig. 1. Models submitted to the PTE Challenge. An entry of ^ under “Uses ILP” 
indicates that the model uses results from an ILP program; x that it does not use 
results from an ILP program; and ? that it is unclear whether ILP results are used. 
The models were constructed as follows: LEI by the ILP program Tilde; LE2 by the 
maximum likelihood technique MACCENT using the results from the ILP program 
WARMR; LE3 by the ILP program ICL; LRD by a stochastic voting technique; LRG 
by a stochastic technique that uses amongst others, results from WARMR and and 
the ILP program P-Progol; OAI by voting with Naive Bayes and the decision-tree 
learner C4.5; OUl by C4.5 and P-Progol; OU2 by C4.5 using amongst others, results 
from WARMR; and TAl using a genetic search technique. In constructing LRD, the 
stochastic technique has access to results from WARMR and P-Progol. However, as 
there is no single model associated with the output, it is unclear whether the ILP 
results were used. 



subset of the PTE-2 compounds. We concentrate here on those techniques that 
involve substantial input from experts. These include models devised directly 
by toxicologists or those that rely on the application of compilations of such 
specialist knowledge (that is, toxicity expert systems). In [2], there are 9 such 
“expert-derived” models due to: Huff et al. (HUE, [7]), OncoLogic (ONC, [20]), 
Bootman (HOT, [3]), Tennant et al. (TEN, [19]), Ashby (ASH, [1]), Benigni et al. 
(BEN, [14]), Purdy (PUR, [13]), DEREK (DER, [10]), and COMPACT (COM, 
[9]). Excluding missing entries, predictions are available from these methods 
for 18 PTE-2 compounds. A comparative tabulation on this subset against the 
ILP-assisted models is in Eigure 3. 

Comparisons based on predictive accuracy overlook two important practical 
concerns, namely (a) class distributions cannot be specified precisely. Distribu- 
tion of classes in the training set are thus rarely matched exactly on new data; 
and (b) that the costs of different types of errors may be unequal. In toxicology 
modelling the cost of false negatives is usually higher than those of false posi- 
tives. Using techniques developed in signal detection, the authors in [11] describe 
an elegant method for the comparative assessment of classifiers that takes these 
considerations into account In summary, they describe a technique for eliminat- 
ing classifiers that could not possibly be “optimal” under any circumstances. 
The details relevant to the two-class problem addressed here are as follows: 
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Model Accuracy 



LRD 


0.87 


(0.07) 


LRG 


0.78 


(0.09) 


OU2 


0.78 


(0.09) 


OAI 


0.74 


(0.09) 


LE3 


0.70 


(0.10) 


LE2 


0.65 


(0.10) 


OUl 


0.57 


(0.10) 


TAl 


0.52 


(0.10) 


LEI 


0.48 


(0.10) 


POS 


0.74 


(0.10) 



Fig. 2. Estimated accuracies of submissions made to the PTE Challenge. Here, accu- 
racy refers to the fraction of PTE-2 compounds correctly classified by the model. The 
quantity in parentheses next to the accuracy figure is the estimated standard error. 
The classifications are based on the outcome of 23 of the 30 PTE-2 bioassays. The 
classification of remaining 7 is yet to be decided. “POS” refers to the simple rule that 
states that all compounds will be carcinogenic. This was not an official submission to 
the challenge and is only included here for completeness. 



1. Let the two classes be denoted + and — respectively. Let 7t(+) and 7t(— ) = 
1 — 7t(+) be the prior probabilities of the classes. Suppose we have unbiased 
estimates for the following: TP, the proportion of instances observed to be 
+ and classified as such; and FP, the proportion of instances observed to be 
— and classified as +. Using the notation in [4], let the costs of false positives 
and false negatives be C'(+|— ) and C'(— 1+) respectively (that is, the cost of 
classifying an instance as + when it is really a — , and vice versa). 

2. The expected misclassification cost of a classifier is then given by 7 t(+) • (1 — 
TP) ■ C'(— 1+) + 7t(— ) • FP ■ C'(+|— ). For brevity, “expected misclassification 
cost” will henceforth be simply called “cost”. It is easy to see that two 
classifiers have the same cost if pp^Z^p\ = 

3. The “operating characteristic” of a binary classifier can be represented as 
a point in the two-dimensional Cartesian space (called “Receiver Operating 
Characteristic” or ROC space) defined by FP on the X axis and TP on 
the Y axis. Classifiers with continuous output are represented by a set of 
points obtained by thresholding the output value (each threshold resulting 
in a classification of the instances into one of + or — ). A set of points may 
also be obtained by varying critical parameters in the binary classification 
technique, with each setting resulting in a binary classifier. 

4. A specification of tt and C defines a family of lines with slope m (as defined 
in item 2 above) in ROC-space. All classifiers on a given line have the same 
cost, and the lines are called iso-performance lines. Lines with a higher TP 
intercept represent classifiers with lower cost (follows from the cost formula 
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Model Accuracy 



LRD 


0, 


,89 


(0.07) 


LRG 


0, 


,84 


(0.09) 


HUF 


0, 


,78 


(0.10) 


LE3 


0, 


,78 


(0.10) 


ONC 


0, 


,78 


(0.10) 


OU2 


0, 


,78 


(0.10) 


LE2 


0, 


,72 


(0.11) 


BEN 


0, 


,67 


(0.11) 


OUl 


0, 


,67 


(0.11) 


ASH 


0, 


,56 


(0.12) 


LEI 


0, 


,56 


(0.12) 


TEN 


0, 


,56 


(0.12) 


BOT 


0, 


,50 


(0.12) 


COM 


0, 


,50 


(0.12) 


DER 


0, 


,50 


(0.12) 


PUR 


0, 


,28 


(0.11) 


POS 


0, 


,67 


(0.11) 



Fig. 3. Comparison of estimated accuracies of ILP-assisted models and expert-derived 
models. As before, estimates of standard errors are in parentheses. ILP-assisted models 
are in bold-face. Although unclear at this stage, we have included LRD in this list. The 
figures are based on the classification of 18 of the 30 PTE-2 compounds for which 
predictions are available from all models. Some expert-derived models include a third 
category of classification called “borderline carcinogen.” These are simply taken as a 
“pos” classification here. As before POS predicts all outcomes as “pos” . 



in item 2). Imprecise specifications of tt and C will give rise to a range of 
possible TO values. 

5. Minimum cost classifiers lie on the edge of the convex hull of the set of 
points in item 3 above. For a given value of to = toi, potentially optimal 
classifiers occur at points in ROC-space where the slope of the hull-edge is 
TOi or at the intersection of edges whose slopes are less than and greater 
than TOi respectively, (the proof of this is in [12]). If operating under a 
range of to values (say [toi,TO 2 ]), then potentially optimal classifiers will lie 
on a segment of the hull-edge (see Figure 4). Henceforth we will call such 
classifiers “FAPP-optimal” (to denote optimal for all practical purposes). 

This procedure for obtaining FAPP-optimal classifiers has not be extended by 
the authors in [11] to classifiers that discriminate between more than 2 classes. In 
fact, the result that optimal classfiers are located on line segments joining vertices 
the convex hull holds for arbitrary number of classes for any cost function that 
is linear in the Euclidean space representing the classifiers (see [15]). With these 
preliminaries in place, we are in a position to examine the the ROC plot of the 
models in Figure 3. This is shown in Figure 5. 
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False Positive Rate (FP) 



Fig. 4. Classifiers in ROC-space. Here A and C are continuous-output classifiers (rep- 
resented by curves) and B, D, and E are binary classifiers (represented by points). 
The edge of the convex hull is the piecewise-linear curve separating the shaded area 
from the unshaded one. Potentially optimal classifiers lie on this edge and are found 
by comparing the slope of a linear segment comprising the edge, against the value m 
determined by the current specification of priors and costs. Thus for m = mi , (7 is the 
only classifier that is potentially optimal. Imprecise specification of these will result in 
a range of values and potentially optimal classifiers then lie on a segment of the hull. 
Thus for m 6 [mi, m 2 ] then potentially optimal classifiers lie along the thickened line 
segment {A, B, C are thus candidates). D and E can never be optimal for any value of 
m. A theoretically optimal classifier for any value of m would have a “step” ROC-curve 
joining the points (0,0), (0,1) and (1,1). 



The graph would appear to suggest that except for HUF, all other expert- 
assisted models are not FAPP-optimal. However, given that the plot is based 
on a very small sample of 18 points, we prefer to interpret these results as 
suggesting that the ILP-assisted models are closer to FAPP-optimal than their 
expert-assisted counterparts. This in itself is unexpected, and worthy of further 
investigation. 

The authors of [11] provide a computer program ROCCH ( www. croft j .net/ 
fawcett/ROCCH/) to calculate the hull and identify FAPP-optimal classifiers. 
The output of this program is summarised in Figure 6. 

It is of interest to examine some representative values for the slope. The 
data available suggest that the prior probabilities on the “pos” class, 7t(+), is 
approximately in the range 0.5 — 0.7. Further, the cost of false negatives C(— 1+) 
distinctly outweighs that of false positives C'(+|— ). Conventions vary, but a 
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False Positive Rate 



Fig. 5. ROC plot of ILP-assisted and expert models. It is unclear whether LRD should 
be included in this plot. The convex hull with LRD is given by the broken line, and 
without, by the unbroken line. NEG refers to a model that predicts all outcomes will 
be non-carcinogenic. 



weighting factor between 10 and 20 would not be uncommon. This specifies the 
slope range [0.02,0.1]. For this range LRG would appear to be FAPP-optimal. 

4 Qualitative Assessment of Models 

At the outset of this section, it is worth emphasising that as submitted, none 
of the ILP-assisted models would be considered toxicologically acceptable. This 
comment extends even to the most transparent submission like OU2, which 
presents a relatively simple decision-tree obtained from a well-known algorithm 
(C4.5). Much of this probably stems from a lack of toxicology expertise amongst 
the model developers. We intend to correct this partially by stipulating mini- 
mal requirements on output descriptions in the next round of experimentation 
(PTE-3: see Section 5). Nevertheless, the quantitative performance of the mod- 
els have been sufficiently intriguing to foster further examination. It is not our 
intention to single out any one model as being the “best” - the toxicological 
shortcomings mentioned would make such statements meaningless. Rather, we 
provide an overall assessment of the type of constructs identified by the models. 

Of considerable toxicological interest is the frequent appearance in all models 
of rules that consist of chemical structure and biological tests. For some time, 
there has been vigorous debate on how classical structure-activity modelling can 
be applied to toxicity problems. This form of modelling relates chemical features 
to activity, and works well in-vitro. The extent to which these ideas transfer to 
toxicity modelling - which deals with the interaction of chemical factors with 
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Slope Range Best classifier (s) 



[ 0 . 000 , 0 . 000 ] 
(0.000,0.242] 
(0.242,5.412] 
(5.412, oo) 



HUF,POS 

LRG 

LRD 

NEG 



(a) 



Slope Range Best Classifier(s) 

[0.000,0.000] HUF,POS 

(0.000, 1.000] LRG 

(1.000,1.562] LE3 

(1.562,3.412] GUI 

(3.412, oo) NEG 

(b) 



Fig. 6. Summary of the output of ROCCFl. Recommendations of the best classifier are 
obtained by determining the ranges of slopes for which a classifier on the hull is on the 
optimal segment. The results are with (a) and without (b) LRD. 



biological systems - is not evident. By using a combination of chemical fea- 
tures and biological test outcomes, the ILP-assisted models provide one possible 
method for dealing with the chemical effects in such “open” systems. If the accu- 
racies obtained with such rules are borne out on larger datasets, then this would 
constitute a significant advance in structure-activity modelling for toxicology. 

A number of aspects of some of the models are in line with what is cur- 
rently known in toxicology. As an example, OU2 selects a combination of mouse 
lymphoma and Drosophilla tests as a strong indicator of carcinogenicity. Many 
toxicologists believe that relationships exist between genotoxicity and carcino- 
genicity. While the the only accepted correlation involves the Salmonella assay, 
this rule suggests a different combination of short-term tests could be equally, 
or more effective. Similar comments could be made on a number of other fronts: 
the presence of methoxy groups, sulphur compounds, and biphenyl groups are 
all identified in various ways as being related to toxicity. Far from being unin- 
teresting, identification of these well-known aspects are essential, as they serve 
to reinforce a toxicologist’s confidence in a model. 

An interesting feature to arise is the re-use of ILP-constructed results by 
other prediction programs. Three ways of doing this are evident: 

A. Incorporate other prediction methods as part of the background knowledge 
for an ILP program. None of the models here were developed in this manner. 

B. Incorporate the results from an ILP program into an established prediction 
method. LRG, LE2 and OU2 were constructed in this manner, where the 
ILP results were incorporated as new features. 

C. Use ILP to explain only those instances that are inadequately modelled by 
established techniques®. OUl was constructed in this manner, with an ILP 
program constructing a “theory of exceptions” to a simple C4.5 rule model. 

Based on results here. Method B appears to hold the greatest promise for tox- 
icology modelling. The ILP program WARMR appears to be particularly well- 
suited to the task of identifying sub-structures that can constitute features for 
a propositional technique [6] . 

This role for ILP was first brought to our attention by Donald Michie. 
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5 The PTE-3 Experiment 

The experiment of predicting carcinogenicity outcome of compounds in PTE-2 

contained the following shortcomings: 

Classification. The simplistic classification into two classes gives the impres- 
sion that toxicity is a property possessed by individual chemicals and that 
it is independent of the multi-factorial, dynamical systems that are used to 
conduct activity-classification experiments. Other classification schemes are 
both desirable and needed. For example, trans-species and gender-specific 
effects sometimes dominate the results obtained from an assay for carcino- 
genesis. 

Data. The size of the PTE-2 data set is too small to obtain reliable statistics 
of performance. Further, the knowledge of chemicals in the test-set compro- 
mised the possibility of a true blind trial. 

Evaluation. There were no specifications provided for the output descriptions. 
This led to model developers presenting their results in diverse ways, none of 
which were particularly meaningful to a toxicologist. This greatly impeded 
a careful assessment of explanatory value. 

Dissemination. No clear directions were provided on methods of publicising 
the results obtained. 

We have attempt to remedy each of these in the proposal of a new experiment, 

PTE-3. It is envisaged that PTE-3 will run from July 1999 to July 2000, and 

have the following attributes® : 

Classification. Predictive models will be required for the following categories: 
(a) carcinogenicity outcome into 2 classes as before; and (b) gender (male or 
female) and species (rat or mouse) specific levels of evidence in 4 categories 
(clear evidence, some evidence, equivocal, and no evidence). 

Data. It is our intention to increase the size of the test set to at least 50 chem- 
icals. If feasible, we will also increase the size of the training set. The test 
set will not be advertised prior to closing date of submissions. 

Evaluation. We intend to stipulate minimal requirements on output descrip- 
tions. These will contain at least the following: (a) contingency table from 
a 10-fold cross-validation on the training set; (b) estimates of accuracy, true 
and false positive rates using (a); (c) number and actual training cases cov- 
ered by each component of the model (a “component” can be, for example, 
a rule in the model). Once the test set is released, these and the estimates 
in (b) have to be provided for this data as well; and (d) complete English 
translations of any special constructs used in the model (such constructs 
could be, for example, the substructure used in a rule). Models that do not 
meet the requirements will be discarded. Evaluation on the rest will proceed 
along quantitative and qualitative scales as before, and is expected to be 
completed by October, 2000. 

® Complete details of Internet site etc., should be available by ILP’99. 
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Dissemination. If the results are sufficiently interesting, we intend to request 
space for a special issue from a toxicology journal. Model developers would 
then be encouraged to submit papers to such an issue. 

6 Concluding Remarks 

This paper has presented an assessment of ILP-assisted models submitted as 
part of a toxicology prediction experiment. The conclusions that can be drawn 
for toxicology modelling are these: (a) that ILP-assisted models have performed 
unexpectedly well on scales of quantitative performance; (b) the techniques cer- 
tainly merit further investigation; and (c) greater attention must be paid to 
providing model-developers with toxicological requirements in order for the out- 
put of such techniques to be deemed “comprehensible”. The new experiment 
proposed is designed to provide ILP techniques with an opportunity to build 
on the promising results here, and take the first steps towards a truly effective 
assistant to an expert toxicologist. 
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